Skip to content

Conversation

@a-feld
Copy link

@a-feld a-feld commented Aug 11, 2019

pip installs that use auth with version control systems appear to be leaking passwords in the logs.

A partial fix appears to have been applied in #5773 but the passwords are still leaking due to subprocess_cmd printing the command description which includes the password in use.

This PR attempts to scrub the passwords out of the logs by using a new HiddenText class to hide the raw url contents in a raw attribute until explicitly needed. The subprocess_cmd function has been modified to extract the raw attribute when calling the underlying command.

@a-feld a-feld marked this pull request as ready for review August 11, 2019 22:21
@cjerdonek
Copy link
Member

Thanks for filing this! I looked at the PR, and I think there is a cleaner way we can be handling this. Making every caller have to construct and pass command_desc and also track the url_idx seems more verbose and error-prone to me.

I'm thinking a simpler solution could involve defining a simple str wrapper class whose __str__() method defaults to the redacted form. That way it can be used in log formatting as is, and there is less chance of accidentally exposing it somewhere. The original, unredacted form could be accessed via an attribute on the class instance (e.g. raw), so it's opt-in for the sensitive form.

The wrapper class could be something like this (the name is just a suggestion, maybe HiddenText):

class SensitiveText(object):  

    def __init__(self, raw, redacted):
        self.raw = raw
        self.redacted = redacted
    
    def __str__(self):
        return self.redacted

And then we could have helper functions as needed so callers won't need to call SensitiveText themselves. Something like (again, names are just suggestions):

def hide_url(url):
    redacted = redact_password_from_url(url)
    return SensitiveText(url, redacted=redacted)

hidden_url = hide_url(...)

and the even simpler:

def hide_password(password):
    return SensitiveText(password, redacted='****')

Callers could "hide" variables as soon as we know about them (though for now, we could just restrict this use to the vcs modules).

Then format_command_args() could do an instance check for SensitiveText when constructing the command description command_desc:

def format_command_args(args):
# type: (List[str]) -> str
"""
Format command arguments for display.
"""
return ' '.join(shlex_quote(arg) for arg in args)

And of course, we'd also want to convert these objects to str when actually invoking subprocess.

@chrahunt
Copy link
Member

@cjerdonek, that would also be relevant to #6834.

@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

@cjerdonek Thanks for the feedback! I agree, the current approach in this PR feels pretty fragile. I'll start working on incorporating those suggestions!

@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

@cjerdonek I've pushed up some edits, things are looking much cleaner code-wise now! 😄
I haven't fully had the opportunity to vet these changes - I'm going to let CI run to see how much damage has been done but I figured in the meantime you might want to have a look 👀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct is that this class can be made a lot simpler. Like, couldn't it work just to do this? (no need for __new__, __init__, __eq__, and __hash__)

class HiddenText(str):

    def __repr__(self):
        return '<HiddenText {!r}>'.format(str(self))

    def __str__(self):
        return self.redacted

    @property
    def raw(self):
        return str.__str__(self)

Then "public" creation functions can be made like this, for example:

def hide_password(password):
    hidden = HiddenText(password)
    hidden.redacted = '****'

    return hidden

a = hide_password('*secret*')

print(str(a))
print(repr(a))
print(a.raw)

which gives:

****
<HiddenText '****'>
*secret*

I think decoupling the creation functions from the class will be better because it keeps the class simpler. Also, after giving it some thought, I think HiddenText is a better name in part because it's simpler to type and say.

By the way, your __hash__() implementation below I don't think is right because it bases the hash off of the redacted form, which can be the same for different values. The above, on the other hand, does the right thing because it inherits what str does, which is to use the underlying string value.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hash off of the redacted form

That's actually intentional. The idea is that if the redacted form is stored in a hashable location, the redacted form is likely what is being operated upon. The example that I ran into was the urlparse cache where the redacted url was being split. In that case (and I think all cases), the redacted form was what was being cached.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the __eq__ is necessary for proper comparison specifically during tests. I was trying to avoid rewriting all of the assertions which compare actual values to expected values but I'm open to rewriting those tests as well if that's a better approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the eq is necessary for proper comparison specifically during tests.

Can you give me an example of some tests that would be affected by what I suggested?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This went back to when I was patching the Link model but at that point there were a few tests that were failing when comparing the actual url (HiddenText) against the expected url (str).

Here were a couple of examples:

def test_expand_missing_env_variables(self, tmpdir, finder):

def test_expand_existing_env_variables(self, tmpdir, finder):

Since I'm no longer updating the Link model, these tests wouldn't be impacted but if the intent is to move the HiddenText cast back into the Link model eventually this kind of issue will come up. At that point, it's whether the tests should be updated or the __eq__ comparison logic.

I defer to your preferences here 😄

@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

@cjerdonek I've pushed simplifications and renames to the HiddenText class. My concerns with the PR are that I've edited the Link model to replace the url attribute with a HiddenText type. I'm not familiar enough with the overall architecture of pip to tell if I've actually broken something by doing this. Do you have any concerns with modifying the Link class? Should I move the HiddenText cast closer in to the versioncontrol logic?

@cjerdonek
Copy link
Member

cjerdonek commented Aug 12, 2019 via email

@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

Scoping down the PR sounds like a good plan 😄
I've moved the HiddenText logic closer in to the version control logic!

@cjerdonek
Copy link
Member

I've pushed simplifications and renames to the HiddenText class.

I still see the class methods..

@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

I still see the class methods.

Sorry, rereading the suggestion posted, I'm not entirely clear on the suggestion. Is the suggestion to remove the class methods entirely and replace with helper functions? Is there an advantage to having functions here over class methods? I'll go with whichever you suggest I'm just curious to learn more about the rationale.

@cjerdonek
Copy link
Member

Yes, the suggestion was to have the creation functions be separate from the class. This keeps the class simpler so it doesn't e.g. have to know about passwords, parsing URL's, etc. Also, if we later want to add something like a HiddenUrl class, we don't have to go around and change every call site. It's just in the one helper function.

@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

@cjerdonek Got it, I also see what you were saying about the simplification. I'm making edits to be more in-line with your comment now 😄

@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

@cjerdonek It doesn't look like the HiddenText class suggested here works. I'm not quite sure how to proceed. There are a couple of ways I can get this to work:

  1. I can use the HiddenText class with the redacted value as the string and the raw value as an additional attribute. The downside to this is the eq behavior may have to be changed, depending on how we want to handle things.
  2. I can remove the string subclass and pass the raw / redacted explicitly but this has the downside of not being able to work in areas that require strings or byte strings (like regex).
  3. Possibly other ways 😄
@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

@cjerdonek I've gone ahead with option 1 and left out all of the __eq__ overrides / etc for now but the tests are failing as a result. The tests will have to be updated or the class will need an __eq__ override to pass. Whichever is preferred I'll implement.

@cjerdonek
Copy link
Member

cjerdonek commented Aug 12, 2019 via email

@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

Can you say why it didn’t work for you?

When I used the class given, I observed the passwords getting leaked into the logs. This was due to the call to format_command_args which was ultimately using the raw value to formulate the description.

@cjerdonek
Copy link
Member

When I used the class given, I observed the passwords getting leaked into the logs. This was due to the call to format_command_args which was ultimately using the raw value to formulate the description.

This is why I said this in my first comment:

Then format_command_args() could do an instance check for SensitiveText when constructing the command description command_desc:

@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

I worry that defaulting to operating on secrets is still error prone. Will there be future issues that occur from defaulting to the use of the raw secret value for string operations? The current implementation is fail-secure, the raw value must be explicitly accessed. I can certainly see advantages / disadvantages either way, it does seem prudent to discuss though.

@cjerdonek
Copy link
Member

cjerdonek commented Aug 12, 2019

I would focus on getting something working that is simple and doesn't require too many changes. And we can always continue to refine. My suggestion in my first comment was "fail secure" in the sense that you're talking about, but I don't know what effect it would have on type-checking without trying it out myself. Personally, I'd rather we didn't subclass from str because it is inherently more complex and harder to reason about (and has many methods we have to consider, etc).

@a-feld
Copy link
Author

a-feld commented Aug 12, 2019

I pushed up an edit which does an isinstance check inside of format_command_args. Thanks for taking the time to help me out on this PR @cjerdonek !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just be shlex_quote(str(arg)).

@cjerdonek
Copy link
Member

cjerdonek commented Aug 12, 2019

I pushed up an edit which does an isinstance check inside of format_command_args.

It looks like the isinstance() check isn't there anymore. The reason I would keep that check is because we can't rely on the type-checker 100%. Like, in Python 2, it's possible that some of the arguments might have unicode type (I recall one function annotated with str that can return a unicode path, but it has an ignore comment), and if there's a non-ascii character, then passing it to str() can cause a UnicodeEncodeError.

Also, I would add a comment explaining why the str() conversion and isinstance() check are being done -- because (1) the argument can have type HiddenText, and (2) for the reason I said above, respectively.

Copy link
Member

@cjerdonek cjerdonek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, as the PR is in better shape now, I've added some more detailed comments.

Also, I don't want to put more on your plate now that things look like they're working, but did you ever try the form of class I suggested in my original comment here? That has some advantages in being more explicit about what's being passed around. For example, it would let us distinguish more easily between HiddenText and str when using the the type-checker (and not get confused about whether something has been converted yet). It also addresses the "failure mode" issue you raised earlier.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this seems harmless, I think it would be better to keep things narrow for now and just keep to the minimal places in the vcs code. We can expand the use in later PR's as the pattern solidifies, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add type annotations to all the new methods and functions you're adding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to avoid adding more tests that depend on network access, IMO, as it adds flakiness, makes the tests slower, etc. It would be best if we can find the appropriate places at which to unit test to test this functionality (starting with the "inner-most" functions).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you have one too many Unions. Can't this just be List[Union[str, HiddenText]]?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy doesn't allow it because this function is in use by req_install.py and a few other files.

mypy appears to treat List[Union[str, HiddenText]] as incompatible with List[str].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you might want to define a short-hand for List[Union[str, HiddenText]] then to make the type annotations more concise. CommandArgs? Then it can be Union[List[str], CommandArgs].

@a-feld
Copy link
Author

a-feld commented Aug 13, 2019

@cjerdonek Ah, I see! Sorry for the misunderstanding - I read the original feedback as there being a way to avoid intercepting all of the run_command points by intercepting the URL upstream using a special type based on this comment:

Callers could "hide" variables as soon as we know about them

I thought it was a cool idea and tried playing around with it! I've pushed an update going back to the original plan of patching all of the URL locations in each of the vcs modules.

@cjerdonek
Copy link
Member

I thought it was a cool idea and tried playing around with it!

I agree, and thanks for your efforts trying it out! :) I think we both got a taste for what that might involve. Since that "next step" would take more work and planning, etc. to make sure it all works out (as it's more extensive and potentially more sophisticated), I think it's better for now to save something like that for a later PR, and just get something simpler in place as a start to address the immediate issue..

@a-feld
Copy link
Author

a-feld commented Aug 13, 2019

@cjerdonek 👍 ! Any ideas on regression tests for this? If I remove the network marker from the tests submitted earlier in the PR, will those be ok? They do tend to run on the slower side.

Copy link
Member

@cjerdonek cjerdonek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick comment. (This isn't exhaustive. Just pointing out one thing for now.)

@a-feld
Copy link
Author

a-feld commented Aug 14, 2019

@cjerdonek OK I've pushed up an edit which annotates the inputs to the VCS methods and hides the url in versioncontrol.py

The only known deviation is that resolve_revision in git uses HiddenText for the url since a git fetch ... $url command is run which can leak passwords if not hidden.

Hopefully I've addressed all feedback 🙏

@cjerdonek
Copy link
Member

One thing that can be done to simplify the changes would be to define a function like this and put it in the same module that you define CommandArgs:

def cast_args(args):
    # type: (Union[List[str], List[HiddenText], CommandArgs]) -> CommandArgs
    new_args = []  # type: CommandArgs
    new_args.extend(args)

    return new_args

Then instead of having to break every list up into several lines, you can just convert the first element. So things like this:

cmd_args = (
    ['export'] + self.get_remote_call_options() +
    rev_options.to_args() + [url, location]
)

would just need to be changed to:

cmd_args = (
    cast_args(['export']) + self.get_remote_call_options() +
    rev_options.to_args() + [url, location]
)

Also, question: how many changes to the code were due just to adding the type annotations themselves and not related to HiddenText? I'm wondering if it would make sense to add type annotations to the vcs modules without HiddentText in a separate PR, so those code changes can be looked at independently.

@a-feld
Copy link
Author

a-feld commented Aug 14, 2019

Aha! I can use a type cast helper! Thanks for the tip, I'm not very practiced with python type annotations 😄

how many changes to the code were due just to adding the type annotations themselves and not related to HiddenText?

I tried to make all of the type annotation updates in a specific commit to hopefully ease some of the review pain
eff94f7

There should be a way in github to filter by specific commit when reviewing the diffs as well.

@cjerdonek
Copy link
Member

cjerdonek commented Aug 14, 2019 via email

@a-feld
Copy link
Author

a-feld commented Aug 14, 2019

Would you be open to filing a PR that adds type annotations to the vcs modules without any changes related to HiddenText?

If you're ok opening and driving that second PR, we can merge that one first and then I can rebase on top of those changes. To be honest I'm having trouble mentally tracking all of the changes flying around at the moment and I feel that driving another PR would add to my cognitive overhead.

I saw some changes that didn’t seem related to the hidden text stuff (e.g. around None values), so I’m assuming it was done to get the type checker to pass.

Yeah, that's correct. In particular the none check you're referring to is because the property returns an optional so the type checker was complaining about accessing an attribute on None-typed values.

@cjerdonek
Copy link
Member

cjerdonek commented Aug 14, 2019 via email

@a-feld
Copy link
Author

a-feld commented Aug 14, 2019

👍 I'll open a 2nd PR for the type annotations and put this one on hold for now.

@cjerdonek
Copy link
Member

I posted PR #6890 to finish the work started here, FYI.

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Sep 16, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Sep 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

auto-locked Outdated issues that have been locked by automation

3 participants