git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Deveshi Dwivedi <deveshigurgaon@gmail.com>
To: git@vger.kernel.org
Cc: christian.couder@gmail.com, karthik nayak <karthik.188@gmail.com>,
	jltobler@gmail.com,  Ayush Chandekar <ayu.chandekar@gmail.com>,
	Siddharth Asthana <siddharthasthana31@gmail.com>,
	 Chandra Pratap <chandrapratap3519@gmail.com>
Subject: [GSOC][RFC] Draft Proposal: Complete and extend the remote-object-info command for git cat-file
Date: Tue, 17 Mar 2026 01:28:48 +0530	[thread overview]
Message-ID: <CAG7UgEQTPhxPeEYkm44+BuSj5GG6PWhRrqGT7Vq7zXFPKZqoag@mail.gmail.com> (raw)

Hi! I would be grateful to get feedback on my proposal draft for GSoC
2026. Thank you very much!
-----------------------------------------------------------------------

Complete and extend the remote-object-info command for git cat-file

-----------------------------------------------------------------------

Personal Information

Name: Deveshi Dwivedi
Email: deveshigurgaon@gmail.com
GitHub: https://github.com/deveshidwivedi
Time Zone: UTC +5:30 (IST)
Education: Final year, IIIT Jabalpur
Blog: https://deveshidwivedi.github.io

-----------------------------------------------------------------------

Past experience with Open Source

I have been contributing to open source for the past few years. My
first contribution was during my second year of college. Since then I
have tried contributing to different projects to get better at
navigating unfamiliar codebases and understanding how real-world
projects are maintained. Some of my contributions include:

- https://github.com/processing/p5.js-web-editor/pull/3492

- https://github.com/neovim/neovim/pull/33235

- https://github.com/kube-vip/kube-vip/pull/1087

- https://github.com/WasmEdge/WasmEdge/pull/3963

- https://github.com/openfoodfacts/openfoodfacts-server/pull/10037

- https://github.com/openfoodfacts/openfoodfacts-server/pull/9967

- https://github.com/processing/p5.js/pull/6761

- https://github.com/processing/p5.js/pull/6669

I was also grateful to be an LFX mentee in Summer 2024 under the Open
Mainframe Project. During the program I worked on building a new
frontend for the Software Discovery Tool and integrating it with the
backend to make the tool easier to use.

-----------------------------------------------------------------------

My contributions so far

*t5403: use test_path_is_file instead of test -f
*Mailing list: https://lore.kernel.org/git/20251229185737.2328-1-deveshigurgaon@gmail.com/
*Status: merged in 'master'
*Description: Replace test -f with test_path_is_file helper in
post-checkout hook test for better failure diagnostics.

*t5403: improve post-checkout hook testing
*Mailing list: https://lore.kernel.org/git/20260112163643.231-1-deveshigurgaon@gmail.com/
*Status: merged in 'master'
*Description: Introduce check_post_checkout helper to eliminate
repetitive hook validation patterns, then switch to test_cmp for
clearer argument mismatch diagnostics.

*t1006: fix %(rest) test for object names with whitespace
*Mailing list: https://lore.kernel.org/git/20260219152407.12160-1-deveshigurgaon@gmail.com/
*Status: dropped
*Description: Submitted a patch to address a FIXME in t1006 around
%(rest) behavior with whitespace in object names. Junio explained that
whitespace-as-delimiter is documented behavior for %(rest), not a bug,
and Victoria clarified the FIXME was intentionally documenting a known
limitation, not requesting a fix. This taught me to read test comments
in their full historical context rather than treating them as isolated
tasks. It also introduced me to t1006-cat-file.sh in depth, which is
directly relevant to this project.

*avoid unnecessary strbuf_split*() and strbuf-by-value usage
*Mailing list: https://lore.kernel.org/git/20260311173336.8395-1-deveshigurgaon@gmail.com/
*Status:Will merge to 'master'
*Description: Eliminate inefficient strbuf_split_str() in combine
filter parsing by using direct string traversal with strchrnul(), and
convert write_worktree_linking_files() to accept path strings instead
of strbuf-by-value parameters.

*coccinelle: detect and fix strbuf-by-value parameters
*Mailing list: https://lore.kernel.org/git/20260315094445.19849-1-deveshigurgaon@gmail.com/
*Status: queued
*Description: Add Coccinelle semantic patch to automatically detect
functions taking struct strbuf by value, transforming them to pointer
parameters and fix the remaining instance in stash.c.

-----------------------------------------------------------------------

Project Overview

This project completes and extends the remote-object-info subcommand
for git cat-file --batch-command, which allows clients to request
object metadata from a remote without downloading full object
contents.

Goal 1: Rebase and finalize Eric Ju's v11 patch series [1], address
the remaining review feedback, and get it merged.
Goal 2: Add %(objecttype) support to the object-info protocol, end-to-end.

-----------------------------------------------------------------------

Proposed Solution

----------

Goal 1: Complete v11 Series

----------

Pre-GSoC Analysis:

I rebased Eric Ju's v11 series [1] onto the current master. There were
conflicts in t/t1006-cat-file.sh, fetch-pack.c, Makefile and
object-file.c. After resolving those and building, I ran the test
suite.

While running t/t1017-cat-file-remote-object-info.sh [2], the first
test failed with "ambiguous redirect". '$daemon_parent' expands to the
trash directory path which contains a space and the redirect is not
quoted:

echo_without_newline "$hello_content" > $daemon_parent/hello

The shell splits on the space and does not know which file to redirect
to. I grepped for other unquoted uses and found the same problem with
$HTTPD_DOCUMENT_ROOT_PATH/http_parent/hello in the http test section.
This is fixed once we quote both the instances.

In review of Calvin Wan's initial remote-object-info implementation
[3], Jonathan Tan observed that the remote-object-info state
is currently stored in static globals rather than in the shared
command data structure. This approach makes it difficult to support
mixing commands in a batch session. The v11 series addressed most of
this by restructuring the code so that remote-object-info now goes
through the same expand_data path used by info. However, one instance
of shared state mutation still remains.

In get_remote_info(), when no explicit format is given, currently the code does:

if (!opt->format)
    opt->format = "%(objectname) %(objectsize)";

The problem is that opt->format is shared across all commands in the
batch session. batch_objects() creates a single expand_data structure,
which every command uses. Mutating opt->format here permanently
replaces the original NULL value.

Fix: Instead of modifying the shared state,  we can use a local
variable in parse_cmd_remote_object_info():

const char *remote_format = opt->format
    ? opt->format
    : "%(objectname) %(objectsize)";

and pass remote_format to get_remote_info() for validation. This fix
is needed regardless of Goal 2. Even when the values happen to match,
mutating shared state from a command handler is incorrect. Once
%(objecttype) support is added, the special-case default disappears
entirely, and both local and remote commands can simply use
DEFAULT_FORMAT.

----------

Review Feedback Analysis

Below are the main issues raised during the v11 review of the
remote-object-info patch series and how I plan to address them:

Issue 1: Format Validation Segfault [4]
The current validation uses a strstr() check to ensure that the format
contains %(objectsize). This is not sufficient. A format like:

%(objecttype) %(objectsize)

passes the check but later causes a segfault.
The crash occurs in expand_atom():

strbuf_addstr(sb, type_name(data->type));

The call chain looks like this:

batch_objects_command()
  → parse_cmd_remote_object_info()
    → get_remote_info()           ← validation belongs here
      → transport_fetch_refs()
    → batch_object_write()
      → expand_format()
        → expand_atom()           ← segfault here

The problem is that data->type may never be initialized. When it
remains OBJ_NONE (0), type_name(0) returns NULL because
object_type_strings[0] is NULL. That NULL is then passed to
strbuf_addstr(), which dereferences it and segfaults.
Jeff King pointed this out during review [4]. While experimenting with
the feature, he tried:

git cat-file --batch-command='%(objecttype) %(objectsize)'

and feeding it a remote-object-info request, which triggered the
crash. I was able to reproduce the same behavior locally using a
simple client/server setup and a blob from a test repository.

Fix: Instead of relying on a strstr() check, the validation should
determine which atoms were actually requested by the format. During
the mark-query phase, expand_format() records requested atoms by
populating the corresponding fields in data.info. After this stage we
can inspect those fields to see exactly which atoms were requested.

If the format asks for something remote-object-info cannot provide,
the command should exit with an error that names the unsupported atom.
The format is defined when the batch session starts, so requesting an
unsupported atom is a configuration error rather than a per-object
condition. Returning empty output would be misleading, since a caller
would not be able to tell whether the object actually lacks that
attribute or the protocol simply does not support it. Failing early
with a clear error avoids silently producing incorrect results.

Implementing this validation requires passing the expand_data instance
down to get_remote_info(). In v11 the function currently has the
signature:

static int get_remote_info(struct batch_options *opt, int argc, const
char **argv)

For this check to work, the expand_data pointer needs to be threaded
through from parse_cmd_remote_object_info(). This ends up being the
same signature change required for the format-mutation fix described
earlier, so both fixes can share the same small refactor.

As an additional safety measure, a defensive guard can be added in
expand_atom() so that it cannot segfault even if validation is
bypassed in a future code path.

Finally, the EXPAND_DATA_INIT macro currently initializes only .mode =
S_IFINVALID. It should also initialize .type = OBJ_BAD. Since OBJ_BAD
is -1, outside the bounds of the object_type_strings array,
type_name() will return NULL, which is then safely handled by the
guard above.

Issue 2: Misleading Input Overflow Error [5]

The current overflow check reports that the command contains too many
objects, but that may not be the real cause. For example, a very long
repository URL could exceed the line length limit and trigger the same
error. Junio pointed this out in his review [5].

Fix: We can handle this with two separate validation steps.

First,we can check the line length before parsing and report any
overflow accurately. After parsing, we can validate the number of
requested objects. Malformed quoting should also be caught during
parsing. Separating these checks ensures that error messages point to
the actual problem, rather than incorrectly blaming too many objects.

Issue 3: Code Style and State Management

The patch series also introduces a few style inconsistencies that
should be cleaned up:
- multi-line comment formatting
- missing blank lines between #define groups
- long macro definitions
- mixing size_t and int for loop counters

In addition, parse_cmd_remote_object_info() should reset all fields in
expand_data that it modifies before returning. The v11 implementation
already resets data->skip_object_info = 0 on both normal and error
paths, but it does not reset data->type or data->size. Resetting these
fields avoids leaking stale remote state into subsequent commands.

data->skip_object_info = 0; (already in v11)
data->type = OBJ_BAD;
data->size = 0;

Without these resets, a batch session that runs remote-object-info
followed by a local info command could produce incorrect output. If
odb_read_object_info_extended() fails for the local object, the
previously populated remote values may still be present in data,
causing stale data to be printed. It is also important that the
data->skip_object_info = 0 reset happens even on the goto cleanup
error path so that the state is fully restored before returning.

New tests to be added for v12:
- %(objecttype) %(objectsize) format: command dies cleanly instead of
segfaulting
- %(objecttype) alone: command dies with a clear error
- %(objectname) only: works without requesting size
- Mixed remote-object-info and info commands in batch mode: both use
the correct default formats (this also catches the format-mutation
issue)

----------
Goal 2: Add support for %(objecttype)
----------
Server Side

struct requested_info in protocol-caps.c is extended to include
unsigned type : 1, alongside the existing unsigned size : 1. The
capability parser in cap_object_info() is updated to recognize type
requests using the same pattern that is already used for size. The
server-side response logic in send_info() is then updated to include
the type when it has been requested.

One useful optimization here is that odb_read_object_info() already
provides the object type as its return value, while the object size is
returned through an output parameter. The current implementation in
send_info() calls this function but discards the return value after
checking whether it is negative. When both size and type are
requested, we can obtain both pieces of information from a single
call. If only type is requested, the call simply passes NULL for the
sizep parameter.

For loose objects, both the type and size are stored in the same
object header ("<type> <size>\0"). For packed objects, the type is
already in the pack entry header, so retrieving it is free.

When sending responses, send_info() includes both size and type in the
headers if requested. Each object line looks like:

<oid> <size> <type>

If the server cannot resolve an attribute, that field is left blank.
Behavior for missing values remains consistent with existing handling.

On the server side, object_info_advertise() in serve.c no longer marks
its struct strbuf *value as UNUSED and now populates it with "size
type". This means the server advertises:

object-info=size type

during capability negotiation. Older clients ignore the value string
per protocol v2 rules, and the server_supports_v2("object-info") check
continues to work, so backward compatibility is maintained.

----------

Client Transport

Before requesting type, the client checks whether the server supports
it using server_supports_feature("object-info", "type", 0). This looks
at the capability value and parses it with parse_feature_request(). If
the server advertises only object-info=size, the check returns false
for type. In that case, if the format requires %(objecttype), the
client exits with a clear error. When building the request,
unsorted_string_list_has_string() is used instead of strstr() to avoid
substring matches.

On the response side, the client keeps track of column positions using
size_index and type_index, both initialized to -1. The attribute
headers sent by the server determine which columns appear and in what
order. The data lines are then parsed using those indices with bounds
checks. Since column 0 is always the OID, the indices use a +1 offset.
For example: in <oid> 1234 blob, column 1 contains the size and column
2 contains the type. If fewer columns are returned than expected, the
bounds checks prevent out-of-range access.

----------

Memory: I will allocate typep per OID the same way v11 already does
for sizep; free_object_info_contents() handles cleanup.

----------

cat-file integration

In get_remote_info(), the format string determines which attributes
are requested from the server. Previously, if %(objectsize) appeared
in the format, "size" was added to object_info_options. With this
change, %(objecttype) similarly adds "type".

Since %(objecttype) is now supported, the earlier allow-list
validation that rejected data->info.typep is removed.

Supporting %(objecttype) also allows the removal of the special-case
default format in get_remote_info(). Both local and remote commands
can now use DEFAULT_FORMAT (%(objectname) %(objecttype)
%(objectsize)), eliminating the previous mismatch in default output.

----------

Backward Compatibility:

A new client with a new server supports both size and type. With a new
client and an old server, server_supports_feature() returns false for
type, and the client exits with a clear error if the format requests
%(objecttype). Size-only requests still work. Old clients work with
any server. They ignore the new type capability and only request the
attributes they understand, so existing workflows continue to work as
before.

----------

Testing for Goal 2

Server-side (t/t5701-git-serve.sh):

Server advertises object-info=size type
Correct type strings for all four object types
Combined size + type and type-only requests

Client-side (t/t1017-cat-file-remote-object-info.sh):

%(objecttype) across git://, file://, http://
Default format includes type after unification
Server that only supports size: clean error for %(objecttype)
Mixed local + remote in buffer mode (state isolation)

----------

Stretch Goals (if time permits)

If Goal 1 and Goal 2 land ahead of schedule, %(objectsize:disk) could
be explored. The server infrastructure already exists via
odb_read_object_info_extended() and the implementation pattern is
identical to %(objecttype). %(deltabase) is a similar extension. Both
depend on server pack format rather than intrinsic object properties,
so either would need mailing list consensus before proceeding.

-----------------------------------------------------------------------

Project Timeline

I have intentionally allocated slightly longer phase intervals to
provide a buffer. In practice, each task may take less time, but this
ensures there is room to handle unexpected delays without affecting
the overall schedule.

Pre-GSoC (Until May 1):
- Continue exploring the codebase.
- Stay engaged with the community and follow discussions.

Community Bonding (May 1 - 25):
- Study the codebase and internals in more depth.
- Review all v11 feedback threads.
- Identify rebase conflicts.
- Discuss protocol design with mentors on the mailing list.

Phase 1: Rebase and Fix (May 26 - Jun 15):
- Rebase v11 onto master.
- Fix all bugs: format validation, input validation, format mutation,
state cleanup, code style, test quoting.
- Add new tests.
- Send v12 to the mailing list.

Phase 2 (Jun 16 - Jul 6):
- Iterate on v12 review feedback.
- Begin server-side type implementation.
- Add server tests, send server patches.

Midterm (Jul 10):
- Goal 1 in final review or merged.
- Server patches posted.

Phase 3: Client and Integration (Jul 14 - Aug 10):
- Iterate on server patches.
- Implement client transport and cat-file integration.

Phase 4: Final (Aug 11 - 24):
- Final review iteration.
- Buffer for unexpected issues.
- Ensure all patches are in the review pipeline.

Final Evaluation (Aug 25 - 31):
- Address any remaining review feedback.

-----------------------------------------------------------------------

Availability

The project size is 350 hours (medium). I plan to dedicate around 35
hours per week during the 12-week coding period to work on the
project. I do not anticipate any major conflicts during this time and
will be able to stay actively engaged with development and
discussions.

-----------------------------------------------------------------------

Post GSoC

I would like to stay active in the Git community even after GSoC.
There is still a lot for me to learn from the project and the
community, and I hope to continue contributing and improving my
understanding of Git’s internals.

-----------------------------------------------------------------------

References

[1] https://lore.kernel.org/git/20250221190451.12536-1-eric.peijian@gmail.com/

[2] https://lore.kernel.org/git/20240628190503.67389-7-eric.peijian@gmail.com/

[3] https://lore.kernel.org/git/20220504212738.162853-1-jonathantanmy@google.com/

[4] https://lore.kernel.org/git/20240628190503.67389-1-eric.peijian@gmail.com/t/#md20501dc269cc38ac1ac8cf7599281b937b651a0

[5] https://lore.kernel.org/git/20240628190503.67389-1-eric.peijian@gmail.com/t/#mbe53f476d6cd32633277c28f17f8b6a59316b1db

--
Thank you,
Deveshi


             reply	other threads:[~2026-03-16 19:59 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-16 19:58 Deveshi Dwivedi [this message]
2026-03-24 10:42 ` [GSOC][RFC] Draft Proposal: Complete and extend the remote-object-info command for git cat-file Christian Couder
2026-03-24 15:45   ` Deveshi Dwivedi
2026-03-24 15:50     ` Deveshi Dwivedi
  -- strict thread matches above, loose matches on Subject: below --
2026-03-23 11:41 Deveshi Dwivedi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAG7UgEQTPhxPeEYkm44+BuSj5GG6PWhRrqGT7Vq7zXFPKZqoag@mail.gmail.com \
    --to=deveshigurgaon@gmail.com \
    --cc=ayu.chandekar@gmail.com \
    --cc=chandrapratap3519@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jltobler@gmail.com \
    --cc=karthik.188@gmail.com \
    --cc=siddharthasthana31@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).