WIP: Icechunk opener #1135

jbusecke · 2025-11-11T22:15:55Z

This is a first start towards building an icechunk opener for earthaccess (see #1132 for context).

This PR depends on #1154

~~This is still very rough and might change a lot~~

This implements a very minimalist test and opener function in earthaccess.icechunk._open_icechunk_from_url. There are a ton of todos and questions, but let me try to point out the most pressing ones.

Zarr V3: Icechunk will need zarr v3. Curious about @DeanHenze @betolink assesement how difficult it will be to refactor the codebase to support zarr v3.
Store location. Since there is no icechunk store (as far as I know) available on any DAAC so far I have prototyped with the MUR SST virtual icechunk store that I recently built. This made me think if we should loosen the requirement to have the actual store within DAACS/CMR. I think it is perfectly legitimate as a user to want to open a virtual zarr store somewhere which points to EDL accessible data, no?
Write docs with examples.

Pull Request (PR) draft checklist - click to expand

Please review our
contributing documentation
before getting started.
Populate a descriptive title. For example, instead of "Updated README.md", use a
title such as "Add testing details to the contributor section of the README".
Example PRs: #763
Populate the body of the pull request with:
- A clear description of the change you are proposing.
- Links to any issues resolved by this PR with text in the PR description, for
  example closes #1. See
  GitHub docs - Linking a pull request to an issue.
Update CHANGELOG.md with details about your change in a section titled
## Unreleased. If such a section does not exist, please create one. Follow
Common Changelog for your additions.
Example PRs: #763
Update the documentation and/or the README.md with details of changes to the
earthaccess interface, if any. Consider new environment variables, function names,
decorators, etc.

Click the "Ready for review" button at the bottom of the "Conversation" tab in GitHub
once these requirements are fulfilled. Don't worry if you see any test failures in
GitHub at this point!

Pull Request (PR) merge checklist - click to expand

Please do your best to complete these requirements! If you need help with any of these
requirements, you can ping the @nsidc/earthaccess-support team in a comment and we
will help you out!

Add unit tests for any new features.
Apply formatting and linting autofixes. You can add a GitHub comment in this Pull
Request containing "pre-commit.ci autofix" to automate this.
Ensure all automated PR checks (seen at the bottom of the "conversation" tab) pass.
Get at least one approving review.

📚 Documentation preview 📚: https://earthaccess--1135.org.readthedocs.build/en/1135/

github-actions · 2025-11-11T22:16:18Z

👈 Launch a binder notebook on this branch for commit c870014

I will automatically update this comment whenever this PR is modified

👈 Launch a binder notebook on this branch for commit d30f966

👈 Launch a binder notebook on this branch for commit 2adc485

👈 Launch a binder notebook on this branch for commit 7f1a8c2

👈 Launch a binder notebook on this branch for commit 2a58274

👈 Launch a binder notebook on this branch for commit 084e654

👈 Launch a binder notebook on this branch for commit 19b8f0f

👈 Launch a binder notebook on this branch for commit 44e9e53

👈 Launch a binder notebook on this branch for commit d2f4670

👈 Launch a binder notebook on this branch for commit 5ba734b

👈 Launch a binder notebook on this branch for commit c775a3b

👈 Launch a binder notebook on this branch for commit 054e133

github-actions · 2025-11-11T22:16:20Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

chuckwondo · 2025-11-12T20:18:52Z

earthaccess/__init__.py

 )
 from .auth import Auth
 from .dmrpp_zarr import open_virtual_dataset, open_virtual_mfdataset
+from .icechunk import _open_icechunk_from_url


Why is there an underscore prefix? This does not match up with the entry you added to __all__.

Ah thanks for catching this @chuckwondo. My intention here was to hack most of this together and develop it further after an initial test (with lots of hardcoded bits) passes.

chuckwondo · 2025-11-12T20:19:16Z

earthaccess/icechunk.py

+# TODO: Figure out how to ensure authentication here.
+
+
+def _get_daac_provider_from_url(url: str) -> str:


The return type is annotated incorrectly.

earthaccess/icechunk.py

earthaccess/icechunk_opener.py

earthaccess/icechunk.py

chuckwondo · 2025-11-12T20:21:43Z

earthaccess/icechunk.py

@@ -0,0 +1,109 @@
+from datetime import datetime
+from typing import Dict, List, Optional


Suggested change

from typing import Dict, List, Optional

Co-authored-by: Chuck Daniels <cjdaniels4@gmail.com>

github-actions · 2025-11-19T19:10:15Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

jbusecke · 2025-11-19T19:10:47Z

Thank you so much for looking into this @chuckwondo.

jbusecke · 2025-11-19T20:38:44Z

Follow up on Slack discussion I had with @betolink and Ryan Abbott over on slack.

The main thing I need for this functionality is a mapping from bucket (+prefix) to a credentials endpoint.

With the help of claude I created a little script that crawls CMR to find this mapping, and crucially see if the mapping between each bucket and credentials endpoint is unique:

Crawl Script v0

#!/usr/bin/env python3
"""
Async CMR Query - Map S3 Buckets to Auth Endpoints
Crawls NASA CMR API and maps buckets (no prefix) to S3 credentials endpoints.
Warns about buckets with conflicting endpoints.
"""

import asyncio
import aiohttp
import json
from typing import Dict, Set, Tuple
from collections import defaultdict


async def fetch_page(
    session: aiohttp.ClientSession,
    base_url: str,
    page_num: int,
    page_size: int,
    cloud_hosted: bool = True
) -> Tuple[int, list, int]:
    """
    Fetch a single page from CMR API.
    
    Returns:
        Tuple of (page_num, items, total_hits)
    """
    params = {
        "page_size": page_size,
        "page_num": page_num
    }
    
    if cloud_hosted:
        params["cloud_hosted"] = "true"
    
    try:
        async with session.get(base_url, params=params, timeout=aiohttp.ClientTimeout(total=30)) as response:
            response.raise_for_status()
            data = await response.json()
            items = data.get("items", [])
            total_hits = int(response.headers.get("CMR-Hits", 0))
            
            return page_num, items, total_hits
            
    except Exception as e:
        print(f"  ✗ Error fetching page {page_num}: {e}")
        return page_num, [], 0


async def query_cmr_async(
    base_url: str = "https://cmr.earthdata.nasa.gov/search/collections.umm_json",
    cloud_hosted: bool = True,
    max_pages: int = 100,
    page_size: int = 100,
    concurrent_requests: int = 10
) -> Dict[str, Dict]:
    """
    Asynchronously query CMR API and collect DirectDistributionInformation.
    
    Args:
        base_url: CMR API endpoint
        cloud_hosted: Filter for cloud-hosted collections
        max_pages: Maximum number of pages to fetch
        page_size: Results per page
        concurrent_requests: Number of concurrent requests
        
    Returns:
        Dictionary mapping concept_id to DirectDistributionInformation
    """
    
    print(f"Starting async CMR query...")
    print(f"  Max pages: {max_pages}")
    print(f"  Page size: {page_size}")
    print(f"  Concurrent requests: {concurrent_requests}")
    print()
    
    results = {}
    
    async with aiohttp.ClientSession() as session:
        # First, fetch page 1 to get total hits
        _, items, total_hits = await fetch_page(session, base_url, 1, page_size, cloud_hosted)
        
        if items:
            for item in items:
                concept_id = item.get("meta", {}).get("concept-id")
                if concept_id:
                    direct_dist_info = item.get("umm", {}).get("DirectDistributionInformation")
                    if direct_dist_info:
                        results[concept_id] = direct_dist_info
        
        print(f"Total collections available: {total_hits}")
        total_pages = min(max_pages, (total_hits + page_size - 1) // page_size)
        print(f"Will fetch {total_pages} page(s)\n")
        
        if total_pages <= 1:
            return results
        
        # Fetch remaining pages concurrently
        tasks = []
        for page_num in range(2, total_pages + 1):
            task = fetch_page(session, base_url, page_num, page_size, cloud_hosted)
            tasks.append(task)
            
            # Process in batches to limit concurrency
            if len(tasks) >= concurrent_requests:
                batch_results = await asyncio.gather(*tasks)
                
                for page_num, items, _ in batch_results:
                    if items:
                        print(f"  ✓ Page {page_num}: {len(items)} collections")
                        for item in items:
                            concept_id = item.get("meta", {}).get("concept-id")
                            if concept_id:
                                direct_dist_info = item.get("umm", {}).get("DirectDistributionInformation")
                                if direct_dist_info:
                                    results[concept_id] = direct_dist_info
                
                tasks = []
        
        # Process remaining tasks
        if tasks:
            batch_results = await asyncio.gather(*tasks)
            
            for page_num, items, _ in batch_results:
                if items:
                    print(f"  ✓ Page {page_num}: {len(items)} collections")
                    for item in items:
                        concept_id = item.get("meta", {}).get("concept-id")
                        if concept_id:
                            direct_dist_info = item.get("umm", {}).get("DirectDistributionInformation")
                            if direct_dist_info:
                                results[concept_id] = direct_dist_info
    
    print(f"\n✓ Collected {len(results)} collections with DirectDistributionInformation\n")
    return results


def extract_bucket(s3_path: str) -> str:
    """Extract just the bucket name from S3 path."""
    if s3_path.startswith('s3://'):
        s3_path = s3_path[5:]
    
    bucket = s3_path.split('/')[0]
    return bucket


def create_bucket_mapping(results: Dict[str, Dict]) -> Tuple[Dict[str, str], Dict[str, Set[str]]]:
    """
    Create mapping from bucket to endpoint.
    
    Returns:
        Tuple of (bucket_to_endpoint, bucket_conflicts)
        where bucket_conflicts contains buckets with multiple endpoints
    """
    
    print("Processing bucket mappings...")
    
    # Track all endpoints seen for each bucket
    bucket_endpoints = defaultdict(set)
    
    for concept_id, info in results.items():
        endpoint = info.get('S3CredentialsAPIEndpoint')
        if not endpoint:
            continue
        
        s3_paths = info.get('S3BucketAndObjectPrefixNames', [])
        
        for s3_path in s3_paths:
            bucket = extract_bucket(s3_path)
            bucket_endpoints[bucket].add(endpoint)
    
    # Create final mapping (using first endpoint alphabetically for conflicts)
    bucket_to_endpoint = {}
    bucket_conflicts = {}
    
    for bucket, endpoints in bucket_endpoints.items():
        if len(endpoints) > 1:
            # Conflict detected
            bucket_conflicts[bucket] = endpoints
            # Use first endpoint alphabetically
            bucket_to_endpoint[bucket] = sorted(endpoints)[0]
        else:
            bucket_to_endpoint[bucket] = next(iter(endpoints))
    
    print(f"✓ Mapped {len(bucket_to_endpoint)} unique buckets to endpoints")
    print(f"⚠ Found {len(bucket_conflicts)} bucket(s) with conflicting endpoints\n")
    
    return bucket_to_endpoint, bucket_conflicts


def print_conflicts(conflicts: Dict[str, Set[str]]):
    """Print warning about buckets with multiple endpoints."""
    if not conflicts:
        print("="*80)
        print("✓ NO CONFLICTS - All buckets have consistent endpoints")
        print("="*80)
        return
    
    print("="*80)
    print("⚠ WARNING: BUCKETS WITH MULTIPLE ENDPOINTS")
    print("="*80)
    print(f"\nFound {len(conflicts)} bucket(s) with conflicting endpoints:\n")
    
    for bucket, endpoints in sorted(conflicts.items()):
        print(f"Bucket: {bucket}")
        for endpoint in sorted(endpoints):
            print(f"  - {endpoint}")
        print()


def print_summary(mapping: Dict[str, str], conflicts: Dict[str, Set[str]]):
    """Print summary statistics."""
    print("="*80)
    print("SUMMARY")
    print("="*80)
    
    unique_endpoints = len(set(mapping.values()))
    print(f"Total unique buckets: {len(mapping)}")
    print(f"Unique endpoints: {unique_endpoints}")
    print(f"Buckets with conflicts: {len(conflicts)}")
    
    # Group by endpoint
    endpoint_groups = defaultdict(list)
    for bucket, endpoint in mapping.items():
        endpoint_groups[endpoint].append(bucket)
    
    print(f"\nBuckets per endpoint:")
    for endpoint, buckets in sorted(endpoint_groups.items(), key=lambda x: len(x[1]), reverse=True):
        print(f"  {endpoint}")
        print(f"    → {len(buckets)} bucket(s)")


async def main():
    """Main execution."""
    
    print("\n" + "="*80)
    print("NASA CMR ASYNC S3 BUCKET TO ENDPOINT MAPPER")
    print("="*80 + "\n")
    
    # Configuration
    MAX_PAGES = 10000  # Adjust this to crawl more/fewer pages
    PAGE_SIZE = 100  # Max is 2000, but 100 is more stable
    CONCURRENT_REQUESTS = 10  # Number of simultaneous requests
    
    # Step 1: Query CMR asynchronously
    results = await query_cmr_async(
        max_pages=MAX_PAGES,
        page_size=PAGE_SIZE,
        concurrent_requests=CONCURRENT_REQUESTS
    )
    
    if not results:
        print("No results collected. Exiting.")
        return
    
    # Step 2: Create bucket mapping and detect conflicts
    mapping, conflicts = create_bucket_mapping(results)
    
    # Step 3: Print conflicts
    print_conflicts(conflicts)
    
    # Step 4: Print summary
    print()
    print_summary(mapping, conflicts)
    
    # Step 5: Save outputs
    print(f"\n{'='*80}")
    print("SAVING RESULTS")
    print("="*80)
    
    with open('bucket_to_endpoint.json', 'w') as f:
        json.dump(mapping, f, indent=2, sort_keys=True)
    print("✓ Saved bucket_to_endpoint.json")
    
    if conflicts:
        conflicts_serializable = {k: list(v) for k, v in conflicts.items()}
        with open('bucket_conflicts.json', 'w') as f:
            json.dump(conflicts_serializable, f, indent=2, sort_keys=True)
        print("✓ Saved bucket_conflicts.json")
    
    with open('cmr_raw_results.json', 'w') as f:
        json.dump(results, f, indent=2)
    print("✓ Saved cmr_raw_results.json")
    
    print(f"\n{'='*80}")
    print("COMPLETE!")
    print("="*80 + "\n")


if __name__ == "__main__":
    asyncio.run(main())

What I get as a result is that mostly that mapping is unique:

{
  "TestBucket": "www.testexample.com",
  "asdc-prod-protected": "https://data.asdc.earthdata.nasa.gov/s3credentials",
  "asf-cumulus-prod-alos2-products": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials",
  "asf-cumulus-prod-aria-products": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials",
  "asf-cumulus-prod-browse": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials",
  "asf-cumulus-prod-opera-browse": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-product": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-products": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-seasat-products": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-ngap2w-p-s1-grd-7d1b4348": "https://sentinel1.asf.alaska.edu/s3credentials",
  "asf-ngap2w-p-s1-ocn-1e29d408": "https://sentinel1.asf.alaska.edu/s3credentials",
  "asf-ngap2w-p-s1-raw-98779950": "https://sentinel1.asf.alaska.edu/s3credentials",
  "asf-ngap2w-p-s1-slc-7b420b89": "https://sentinel1.asf.alaska.edu/s3credentials",
  "asf-ngap2w-p-s1-xml-8cf7476b": "https://sentinel1.asf.alaska.edu/s3credentials",
  "csda-cumulus-prod-protected-5047": "https://data.csdap.earthdata.nasa.gov/s3credentials",
  "gesdisc-cumulus-prod-protected": "https://data.gesdisc.earthdata.nasa.gov/s3credentials",
  "gesdisc-cumulus-prod-protectedAqua_AIRS_Level2": "https://data.gesdisc.earthdata.nasa.gov/s3credentials",
  "ghrcw-protected": "https://data.ghrc.earthdata.nasa.gov/s3credentials",
  "ghrcwuat-protected": "https://data.ghrc.uat.earthdata.nasa.gov/s3credentials",
  "lp-prod-protected": "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
  "lp-prod-public": "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
  "lp-protected": "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
  "lp-public": "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
  "lp-sit-protected": "https://data.lpdaac.sit.earthdatacloud.nasa.gov/s3credentials",
  "lp-sit-public": "https://data.lpdaac.sit.earthdatacloud.nasa.gov/s3credentials",
  "nsidc-cumulus-prod-protected": "https://data.nsidc.earthdatacloud.nasa.gov/s3credentials",
  "nsidc-cumulus-prod-public": "https://data.nsidc.earthdatacloud.nasa.gov/s3credentials",
  "ob-cumulus-prod-public": "https://obdaac-tea.earthdatacloud.nasa.gov/s3credentials",
  "ob-cumulus-sit-public": "https://obdaac-tea.sit.earthdatacloud.nasa.gov/s3credentials",
  "ob-cumulus-uat-public": "https://obdaac-tea.uat.earthdatacloud.nasa.gov/s3credentials",
  "ornl-cumulus-prod-protected": "https://data.ornldaac.earthdata.nasa.gov/s3credentials",
  "ornl-cumulus-prod-public": "https://data.ornldaac.earthdata.nasa.gov/s3credentials",
  "podaac-ops-cumulus-docs": "https://archive.podaac.earthdata.nasa.gov/s3credentials",
  "podaac-ops-cumulus-protected": "https://archive.podaac.earthdata.nasa.gov/s3credentials",
  "podaac-ops-cumulus-public": "https://archive.podaac.earthdata.nasa.gov/s3credentials",
  "podaac-swot-ops-cumulus-protected": "https://archive.swot.podaac.earthdata.nasa.gov/s3credentials",
  "podaac-swot-ops-cumulus-public": "https://archive.swot.podaac.earthdata.nasa.gov/s3credentials",
  "prod-lads": "https://data.laadsdaac.earthdatacloud.nasa.gov/s3credentials"
}

with a few exceptions:

{
  "asf-cumulus-prod-opera-browse": [
    "https://cumulus.asf.alaska.edu/s3credentials",
    "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials"
  ],
  "asf-cumulus-prod-opera-products": [
    "https://cumulus.asf.alaska.edu/s3credentials",
    "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials"
  ]
}

Ill keep digging, to see if the bucket+first 'folder' is in fact unique!

jbusecke · 2025-11-19T21:03:24Z

Ok i think this might just be sufficient for now.

Crawl script V1

#!/usr/bin/env python3
"""
Async CMR Query - Map S3 Buckets to Auth Endpoints
Crawls NASA CMR API and maps bucket/prefix keys to S3 credentials endpoints.
Warns about bucket/prefix keys with conflicting endpoints.

Note: In S3, there are no actual "folders" - only keys with delimiters (/).
What looks like a folder structure is just part of the object key.
"""

import asyncio
import aiohttp
import json
from typing import Dict, Set, Tuple
from collections import defaultdict


async def fetch_page(
    session: aiohttp.ClientSession,
    base_url: str,
    page_num: int,
    page_size: int,
    cloud_hosted: bool = True
) -> Tuple[int, list, int]:
    """
    Fetch a single page from CMR API.
    
    Returns:
        Tuple of (page_num, items, total_hits)
    """
    params = {
        "page_size": page_size,
        "page_num": page_num
    }
    
    if cloud_hosted:
        params["cloud_hosted"] = "true"
    
    try:
        async with session.get(base_url, params=params, timeout=aiohttp.ClientTimeout(total=30)) as response:
            response.raise_for_status()
            data = await response.json()
            items = data.get("items", [])
            total_hits = int(response.headers.get("CMR-Hits", 0))
            
            return page_num, items, total_hits
            
    except Exception as e:
        print(f"  ✗ Error fetching page {page_num}: {e}")
        return page_num, [], 0


async def query_cmr_async(
    base_url: str = "https://cmr.earthdata.nasa.gov/search/collections.umm_json",
    cloud_hosted: bool = True,
    max_pages: int = 100,
    page_size: int = 100,
    concurrent_requests: int = 10
) -> Dict[str, Dict]:
    """
    Asynchronously query CMR API and collect DirectDistributionInformation.
    
    Args:
        base_url: CMR API endpoint
        cloud_hosted: Filter for cloud-hosted collections
        max_pages: Maximum number of pages to fetch
        page_size: Results per page
        concurrent_requests: Number of concurrent requests
        
    Returns:
        Dictionary mapping concept_id to DirectDistributionInformation
    """
    
    print(f"Starting async CMR query...")
    print(f"  Max pages: {max_pages}")
    print(f"  Page size: {page_size}")
    print(f"  Concurrent requests: {concurrent_requests}")
    print()
    
    results = {}
    
    async with aiohttp.ClientSession() as session:
        # First, fetch page 1 to get total hits
        _, items, total_hits = await fetch_page(session, base_url, 1, page_size, cloud_hosted)
        
        if items:
            for item in items:
                concept_id = item.get("meta", {}).get("concept-id")
                if concept_id:
                    direct_dist_info = item.get("umm", {}).get("DirectDistributionInformation")
                    if direct_dist_info:
                        results[concept_id] = direct_dist_info
        
        print(f"Total collections available: {total_hits}")
        total_pages = min(max_pages, (total_hits + page_size - 1) // page_size)
        print(f"Will fetch {total_pages} page(s)\n")
        
        if total_pages <= 1:
            return results
        
        # Fetch remaining pages concurrently
        tasks = []
        for page_num in range(2, total_pages + 1):
            task = fetch_page(session, base_url, page_num, page_size, cloud_hosted)
            tasks.append(task)
            
            # Process in batches to limit concurrency
            if len(tasks) >= concurrent_requests:
                batch_results = await asyncio.gather(*tasks)
                
                for page_num, items, _ in batch_results:
                    if items:
                        print(f"  ✓ Page {page_num}: {len(items)} collections")
                        for item in items:
                            concept_id = item.get("meta", {}).get("concept-id")
                            if concept_id:
                                direct_dist_info = item.get("umm", {}).get("DirectDistributionInformation")
                                if direct_dist_info:
                                    results[concept_id] = direct_dist_info
                
                tasks = []
        
        # Process remaining tasks
        if tasks:
            batch_results = await asyncio.gather(*tasks)
            
            for page_num, items, _ in batch_results:
                if items:
                    print(f"  ✓ Page {page_num}: {len(items)} collections")
                    for item in items:
                        concept_id = item.get("meta", {}).get("concept-id")
                        if concept_id:
                            direct_dist_info = item.get("umm", {}).get("DirectDistributionInformation")
                            if direct_dist_info:
                                results[concept_id] = direct_dist_info
    
    print(f"\n✓ Collected {len(results)} collections with DirectDistributionInformation\n")
    return results


def extract_bucket_prefix_key(s3_path: str, prefix_depth: int = 0) -> str:
    """
    Extract bucket and prefix up to specified depth from S3 path.
    
    In S3, there are no actual folders - only object keys with '/' delimiters.
    This function extracts the bucket and the first N prefix components.
    
    Args:
        s3_path: S3 path like 's3://bucket/prefix/component1/component2'
        prefix_depth: How many prefix components to include (0 = bucket only)
        
    Returns:
        String like 'bucket' (depth=0) or 'bucket/prefix/component1' (depth=2)
        
    Examples:
        extract_bucket_prefix_key('s3://my-bucket/data/2024/file.txt', 0) -> 'my-bucket'
        extract_bucket_prefix_key('s3://my-bucket/data/2024/file.txt', 1) -> 'my-bucket/data'
        extract_bucket_prefix_key('s3://my-bucket/data/2024/file.txt', 2) -> 'my-bucket/data/2024'
    """
    if s3_path.startswith('s3://'):
        s3_path = s3_path[5:]
    
    parts = s3_path.split('/')
    bucket = parts[0]
    
    if prefix_depth == 0:
        return bucket
    
    # Include bucket + prefix_depth components
    # Handle case where path doesn't have enough components
    end_idx = min(1 + prefix_depth, len(parts))
    key = '/'.join(parts[:end_idx])
    
    return key


def create_bucket_key_mapping_recursive(
    results: Dict[str, Dict],
    max_depth: int = 5
) -> Tuple[Dict[str, str], Dict[str, Set[str]]]:
    """
    Create mapping from bucket/prefix key to endpoint.
    Recursively increases depth only for keys that have conflicts.
    
    Strategy:
    1. Start at depth 0 (bucket only)
    2. For any key with multiple endpoints, increase depth by 1
    3. Repeat until no conflicts or max_depth reached
    
    Args:
        results: Dictionary of DirectDistributionInformation
        max_depth: Maximum prefix depth to try
    
    Returns:
        Tuple of (key_to_endpoint, remaining_conflicts)
    """
    
    print(f"Building bucket/prefix mapping with recursive conflict resolution...")
    print(f"Maximum depth: {max_depth}\n")
    
    # First, collect all S3 paths and their endpoints
    path_endpoints = []  # List of (s3_path, endpoint)
    
    for concept_id, info in results.items():
        endpoint = info.get('S3CredentialsAPIEndpoint')
        if not endpoint:
            continue
        
        s3_paths = info.get('S3BucketAndObjectPrefixNames', [])
        for s3_path in s3_paths:
            path_endpoints.append((s3_path, endpoint))
    
    print(f"Total S3 paths to process: {len(path_endpoints)}\n")
    
    # Track final mapping and paths that still need processing
    final_mapping = {}
    paths_to_process = path_endpoints  # Start with all paths at depth 0
    
    for depth in range(max_depth + 1):
        if not paths_to_process:
            break
        
        print(f"Processing depth {depth}...")
        
        # Build mapping at current depth for paths still being processed
        key_endpoints = defaultdict(set)
        key_original_paths = defaultdict(list)  # Track which original paths map to each key
        
        for s3_path, endpoint in paths_to_process:
            key = extract_bucket_prefix_key(s3_path, depth)
            key_endpoints[key].add(endpoint)
            key_original_paths[key].append((s3_path, endpoint))
        
        # Separate unique keys from conflicting keys
        paths_still_conflicting = []
        resolved_count = 0
        conflict_count = 0
        
        for key, endpoints in key_endpoints.items():
            if len(endpoints) == 1:
                # No conflict at this depth - add to final mapping
                final_mapping[key] = next(iter(endpoints))
                resolved_count += 1
            else:
                # Still conflicting - need to go deeper
                if depth < max_depth:
                    # Add these paths back for processing at next depth
                    paths_still_conflicting.extend(key_original_paths[key])
                    conflict_count += 1
                else:
                    # Max depth reached, pick first endpoint alphabetically
                    final_mapping[key] = sorted(endpoints)[0]
                    resolved_count += 1
                    conflict_count += 1
        
        print(f"  Resolved: {resolved_count} unique keys")
        print(f"  Conflicts: {conflict_count} keys")
        
        if depth < max_depth and paths_still_conflicting:
            print(f"  → Moving {len(paths_still_conflicting)} paths to depth {depth + 1}")
        
        paths_to_process = paths_still_conflicting
    
    # Find any remaining conflicts (shouldn't happen unless max_depth reached)
    remaining_conflicts = {}
    key_endpoints_final = defaultdict(set)
    
    for key, endpoint in final_mapping.items():
        key_endpoints_final[key].add(endpoint)
    
    # Re-scan to find actual conflicts in final mapping
    # (can happen if max_depth is reached)
    for concept_id, info in results.items():
        endpoint = info.get('S3CredentialsAPIEndpoint')
        if not endpoint:
            continue
        
        s3_paths = info.get('S3BucketAndObjectPrefixNames', [])
        for s3_path in s3_paths:
            # Find which key this path mapped to
            for depth in range(max_depth + 1):
                key = extract_bucket_prefix_key(s3_path, depth)
                if key in final_mapping:
                    if final_mapping[key] != endpoint:
                        if key not in remaining_conflicts:
                            remaining_conflicts[key] = set()
                        remaining_conflicts[key].add(endpoint)
                        remaining_conflicts[key].add(final_mapping[key])
                    break
    
    print(f"\n✓ Final mapping has {len(final_mapping)} unique bucket/prefix keys")
    print(f"⚠ Unresolved conflicts: {len(remaining_conflicts)} keys\n")
    
    return final_mapping, remaining_conflicts


def print_conflicts(conflicts: Dict[str, Set[str]]):
    """Print warning about bucket/prefix keys with multiple endpoints."""
    if not conflicts:
        print("="*80)
        print("✓ NO CONFLICTS - All bucket/prefix keys have unique endpoints")
        print("="*80)
        return
    
    print("="*80)
    print("⚠ WARNING: BUCKET/PREFIX KEYS WITH MULTIPLE ENDPOINTS")
    print("="*80)
    print(f"\nFound {len(conflicts)} key(s) with unresolved conflicts:\n")
    print("(These conflicts could not be resolved even at maximum depth)\n")
    
    for key, endpoints in sorted(conflicts.items()):
        depth = key.count('/')
        print(f"Key: {key} (depth={depth})")
        for endpoint in sorted(endpoints):
            print(f"  - {endpoint}")
        print()


def print_summary(mapping: Dict[str, str], conflicts: Dict[str, Set[str]]):
    """Print summary statistics."""
    print("="*80)
    print("SUMMARY")
    print("="*80)
    
    unique_endpoints = len(set(mapping.values()))
    print(f"Total unique bucket/prefix keys: {len(mapping)}")
    print(f"Unique endpoints: {unique_endpoints}")
    print(f"Unresolved conflicts: {len(conflicts)}")
    
    # Show depth distribution
    depth_counts = defaultdict(int)
    for key in mapping.keys():
        depth = key.count('/')
        depth_counts[depth] += 1
    
    print(f"\nDepth distribution:")
    for depth in sorted(depth_counts.keys()):
        print(f"  Depth {depth}: {depth_counts[depth]} keys")
    
    # Group by endpoint
    endpoint_groups = defaultdict(list)
    for key, endpoint in mapping.items():
        endpoint_groups[endpoint].append(key)
    
    print(f"\nKeys per endpoint:")
    for endpoint, keys in sorted(endpoint_groups.items(), key=lambda x: len(x[1]), reverse=True):
        print(f"  {endpoint}")
        print(f"    → {len(keys)} key(s)")
        # Show a few examples with their depths
        if len(keys) <= 3:
            for key in sorted(keys):
                depth = key.count('/')
                print(f"       - {key} (depth={depth})")
        else:
            for key in sorted(keys)[:3]:
                depth = key.count('/')
                print(f"       - {key} (depth={depth})")
            print(f"       ... and {len(keys) - 3} more")


async def main():
    """Main execution."""
    
    print("\n" + "="*80)
    print("NASA CMR ASYNC S3 BUCKET/PREFIX TO ENDPOINT MAPPER")
    print("(Recursive Conflict Resolution)")
    print("="*80 + "\n")
    
    # Configuration
    MAX_PAGES = 1000           # Adjust this to crawl more/fewer pages
    PAGE_SIZE = 100           # Max is 2000, but 100 is more stable
    CONCURRENT_REQUESTS = 10  # Number of simultaneous requests
    MAX_DEPTH = 5             # Maximum prefix depth to try for conflict resolution
    
    print(f"Configuration:")
    print(f"  Max depth for conflict resolution: {MAX_DEPTH}")
    print(f"  Strategy: Start at depth 0, increase depth only for conflicts")
    print()
    
    # Step 1: Query CMR asynchronously
    results = await query_cmr_async(
        max_pages=MAX_PAGES,
        page_size=PAGE_SIZE,
        concurrent_requests=CONCURRENT_REQUESTS
    )
    
    if not results:
        print("No results collected. Exiting.")
        return
    
    # Step 2: Create bucket/prefix mapping with recursive conflict resolution
    mapping, conflicts = create_bucket_key_mapping_recursive(results, max_depth=MAX_DEPTH)
    
    # Step 3: Print conflicts
    print_conflicts(conflicts)
    
    # Step 4: Print summary
    print()
    print_summary(mapping, conflicts)
    
    # Step 5: Save outputs
    print(f"\n{'='*80}")
    print("SAVING RESULTS")
    print("="*80)
    
    with open('bucket_to_endpoint.json', 'w') as f:
        json.dump(mapping, f, indent=2, sort_keys=True)
    print("✓ Saved bucket_to_endpoint.json")
    
    if conflicts:
        conflicts_serializable = {k: list(v) for k, v in conflicts.items()}
        with open('bucket_conflicts.json', 'w') as f:
            json.dump(conflicts_serializable, f, indent=2, sort_keys=True)
        print("✓ Saved bucket_conflicts.json")
    
    with open('cmr_raw_results.json', 'w') as f:
        json.dump(results, f, indent=2)
    print("✓ Saved cmr_raw_results.json")
    
    print(f"\n{'='*80}")
    print("COMPLETE!")
    print("="*80 + "\n")


if __name__ == "__main__":
    asyncio.run(main())

gives:

{
  "TestBucket": "www.testexample.com",
  "asdc-prod-protected": "https://data.asdc.earthdata.nasa.gov/s3credentials",
  "asf-cumulus-prod-alos2-products": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials",
  "asf-cumulus-prod-aria-products": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials",
  "asf-cumulus-prod-browse": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials",
  "asf-cumulus-prod-opera-browse/OPERA_L2_CSLC-S1": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-browse/OPERA_L2_RTC-S1": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-browse/OPERA_L4_TROPO-ZENITH_V1": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials",
  "asf-cumulus-prod-opera-product": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-products/OPERA_L2_CSLC-S1": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-products/OPERA_L2_CSLC-S1_STATIC": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-products/OPERA_L2_RTC-S1": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-products/OPERA_L2_RTC-S1_STATIC": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-products/OPERA_L4_TROPO-ZENITH_V1": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials",
  "asf-cumulus-prod-seasat-products": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-ngap2w-p-s1-grd-7d1b4348": "https://sentinel1.asf.alaska.edu/s3credentials",
  "asf-ngap2w-p-s1-ocn-1e29d408": "https://sentinel1.asf.alaska.edu/s3credentials",
  "asf-ngap2w-p-s1-raw-98779950": "https://sentinel1.asf.alaska.edu/s3credentials",
  "asf-ngap2w-p-s1-slc-7b420b89": "https://sentinel1.asf.alaska.edu/s3credentials",
  "asf-ngap2w-p-s1-xml-8cf7476b": "https://sentinel1.asf.alaska.edu/s3credentials",
  "csda-cumulus-prod-protected-5047": "https://data.csdap.earthdata.nasa.gov/s3credentials",
  "gesdisc-cumulus-prod-protected": "https://data.gesdisc.earthdata.nasa.gov/s3credentials",
  "gesdisc-cumulus-prod-protectedAqua_AIRS_Level2": "https://data.gesdisc.earthdata.nasa.gov/s3credentials",
  "ghrcw-protected": "https://data.ghrc.earthdata.nasa.gov/s3credentials",
  "ghrcwuat-protected": "https://data.ghrc.uat.earthdata.nasa.gov/s3credentials",
  "lp-prod-protected": "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
  "lp-prod-public": "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
  "lp-protected": "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
  "lp-public": "https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials",
  "lp-sit-protected": "https://data.lpdaac.sit.earthdatacloud.nasa.gov/s3credentials",
  "lp-sit-public": "https://data.lpdaac.sit.earthdatacloud.nasa.gov/s3credentials",
  "nsidc-cumulus-prod-protected": "https://data.nsidc.earthdatacloud.nasa.gov/s3credentials",
  "nsidc-cumulus-prod-public": "https://data.nsidc.earthdatacloud.nasa.gov/s3credentials",
  "ob-cumulus-prod-public": "https://obdaac-tea.earthdatacloud.nasa.gov/s3credentials",
  "ob-cumulus-sit-public": "https://obdaac-tea.sit.earthdatacloud.nasa.gov/s3credentials",
  "ob-cumulus-uat-public": "https://obdaac-tea.uat.earthdatacloud.nasa.gov/s3credentials",
  "ornl-cumulus-prod-protected": "https://data.ornldaac.earthdata.nasa.gov/s3credentials",
  "ornl-cumulus-prod-public": "https://data.ornldaac.earthdata.nasa.gov/s3credentials",
  "podaac-ops-cumulus-docs": "https://archive.podaac.earthdata.nasa.gov/s3credentials",
  "podaac-ops-cumulus-protected": "https://archive.podaac.earthdata.nasa.gov/s3credentials",
  "podaac-ops-cumulus-public": "https://archive.podaac.earthdata.nasa.gov/s3credentials",
  "podaac-swot-ops-cumulus-protected": "https://archive.swot.podaac.earthdata.nasa.gov/s3credentials",
  "podaac-swot-ops-cumulus-public": "https://archive.swot.podaac.earthdata.nasa.gov/s3credentials",
  "prod-lads": "https://data.laadsdaac.earthdatacloud.nasa.gov/s3credentials"
}

I am inclined to just commit this mapping to the repo and add the script so we could update it quickly if things change? Or is this a really crappy idea?

…cess into icechunk_opener

github-actions · 2025-11-19T21:38:31Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

jhkennedy · 2025-11-19T21:51:20Z

I am inclined to just commit this mapping to the repo and add the script so we could update it quickly if things change?

This probably makes the most sense for now.

few exceptions:

{
  "asf-cumulus-prod-opera-browse": [
    "https://cumulus.asf.alaska.edu/s3credentials",
    "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials"
  ],
  "asf-cumulus-prod-opera-products": [
    "https://cumulus.asf.alaska.edu/s3credentials",
    "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials"
  ]
}

These both should work identically but cumulus.asf.alaska.edu is deprecated in favor of cumulus.asf.earthdatacloud.nasa.gov since we're being asked to consolidate under .nasa.gov and the alaska.edu one will be removed at some point (it's at least quarters away from happening).

It'd be worth prioritizing the .nasa.gov ones in the list you commit.

jhkennedy · 2025-11-19T21:55:03Z

Also, these ones:

{
  "asf-cumulus-prod-opera-browse/OPERA_L2_CSLC-S1": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-browse/OPERA_L2_RTC-S1": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-browse/OPERA_L4_TROPO-ZENITH_V1": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials",
  "asf-cumulus-prod-opera-products/OPERA_L2_CSLC-S1": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-products/OPERA_L2_CSLC-S1_STATIC": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-products/OPERA_L2_RTC-S1": "https://cumulus.asf.alaska.edu/s3credentials",
  "asf-cumulus-prod-opera-products/OPERA_L2_RTC-S1_STATIC": "https://cumulus.asf.alaska.edu/s3credentials",
}

All have a prefix included with the bucket name and could just be:

{
  "asf-cumulus-prod-opera-browse": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials",
  "asf-cumulus-prod-opera-products": "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials"
}

jbusecke · 2025-11-23T20:14:06Z

I am inclined to just commit this mapping to the repo and add the script so we could update it quickly if things change?

This probably makes the most sense for now.
few exceptions:
{
  "asf-cumulus-prod-opera-browse": [
    "https://cumulus.asf.alaska.edu/s3credentials",
    "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials"
  ],
  "asf-cumulus-prod-opera-products": [
    "https://cumulus.asf.alaska.edu/s3credentials",
    "https://cumulus.asf.earthdatacloud.nasa.gov/s3credentials"
  ]
}
These both should work identically but cumulus.asf.alaska.edu is deprecated in favor of cumulus.asf.earthdatacloud.nasa.gov since we're being asked to consolidate under .nasa.gov and the alaska.edu one will be removed at some point (it's at least quarters away from happening).

It'd be worth prioritizing the .nasa.gov ones in the list you commit.

Ah that is super helpful. @jhkennedy will you be at the hack on tuesday by any chance?

jbusecke · 2025-11-23T21:22:10Z

Another question for DAAC folks here: Is there any way to add a dummy icechunk store to any of the EDL authenticated buckets? I think that might be the easiest way to test the top level functionality here.

I am revising the structure of the code quite a bit at the moment. My plan is to support two main usecases:

1. "Full EDL" case - Icechunk store and any virtual chunks (if present) are within EDL buckets

from earthaccess.icechunk import open_icechunk_from_url
url = 's3://some-edl-bucket/pointing/to/ic/store' # how to get this url will be solved by different logic
store = open_icechunk_from_url(url)

simple as that, but at this point this is a non-existent use case AFAICT?

2. "Virtual EDL Chunks" Icechunk store is wherever, but all the virtual chunks point to one or more EDL buckets:

import icechunk as ic
from earthaccess.icechunk import get_virtual_chunk_credentials
storage = ... # configure your custom icechunk storage
vchunk_credentials = get_virtual_chunk_credentials(storage)
repo = ic.Repository.open(storage=storage, authorize_virtual_chunk_access=vchunk_credentials)
...

This is not quite as automatic but will actually help a lot of current use cases I think. It also is quite a LOT shorter than what I have to do here for example.
EDIT: This could be further simplified for users by upstream icechunk features like this

I think this would even enable more 'frankenstein-ish' cases, where an icechunk repo points to some EDL, and some non-EDL buckets (to be tested).

github-actions · 2025-11-23T21:48:47Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

github-actions · 2025-11-23T21:54:30Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

jbusecke · 2025-11-24T17:29:41Z

@betolink could you assign me to this PR so we can track it easier on the devseed side?

github-actions · 2025-12-09T20:04:37Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

…cess into icechunk_opener

github-actions · 2025-12-09T20:12:38Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

github-actions · 2025-12-09T20:27:08Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

chuckwondo · 2025-12-09T20:59:32Z

@jbusecke, looks like integration tests are failing due to h5netcdf engine not being installed. See https://github.com/nsidc/earthaccess/actions/runs/20077100690/job/57595175133?pr=1135#step:8:768

chuckwondo · 2025-12-09T21:00:36Z

@jbusecke, see https://github.com/nsidc/earthaccess/actions/runs/20077100690/job/57595175133?pr=1135#step:8:606

github-actions · 2025-12-15T17:18:45Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

github-actions · 2025-12-16T02:25:33Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

github-actions · 2025-12-16T12:41:40Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

chuckwondo · 2025-12-16T13:10:36Z

@jbusecke, integration tests are still failing due to missing dependency: https://github.com/nsidc/earthaccess/actions/runs/20254367444/job/58196492650?pr=1135#step:8:1079

jbusecke · 2025-12-16T15:09:02Z

Ok I think this is getting into shape. I have tested the top level functions locally today (basically all the test in ./tests/integation/test_icechunk_opener.py and they worked 🚀

Disclaimer: This all is currently still based on this hardcoded mapping which needs to be refactored after #1154 is fixed. So this should not be merged, but I think the functionality can be reviewed nonetheless

I ll post a few questions I had on the code directly after this post.

I did run the integration tests locally on the veda hub and got these (seemingly unrelated?) errors.

Could somebody advise if I interpreted this correctly and maybe also enable me run the integration tests here?

@jhkennedy @betolink @chuckwondo would any of you have a bit of time to take a look at this?

Details

================================================================================== ERRORS ===================================================================================
_________________________________________________________ ERROR collecting tests/integration/test_cloud_download.py _________________________________________________________
.nox/integration-tests/lib/python3.12/site-packages/_pytest/runner.py:353: in from_call
result: TResult | None = func()
^^^^^^
.nox/integration-tests/lib/python3.12/site-packages/_pytest/runner.py:398: in collect
return list(collector.collect())
^^^^^^^^^^^^^^^^^^^
.nox/integration-tests/lib/python3.12/site-packages/_pytest/python.py:761: in collect
self.warn(
.nox/integration-tests/lib/python3.12/site-packages/_pytest/nodes.py:271: in warn
warnings.warn_explicit(
E pytest.PytestCollectionWarning: cannot collect test class 'TestParam' because it has a init constructor (from: tests/integration/test_cloud_download.py)
___________________________________________________________ ERROR collecting tests/integration/test_cloud_open.py ___________________________________________________________
ImportError while importing test module '/home/jovyan/earthaccess/tests/integration/test_cloud_open.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/srv/conda/envs/notebook/lib/python3.12/importlib/init.py:90: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/integration/test_cloud_open.py:4: in
import magic
.nox/integration-tests/lib/python3.12/site-packages/magic/init.py:209: in
libmagic = loader.load_lib()
^^^^^^^^^^^^^^^^^
.nox/integration-tests/lib/python3.12/site-packages/magic/loader.py:49: in load_lib
raise ImportError('failed to find libmagic. Check your installation')
E ImportError: failed to find libmagic. Check your installation
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
============================================================================= 2 errors in 9.21s =============================================================================
nox > Command pytest tests/integration -rxXs failed with exit code 2
nox > Session integration-tests failed.

jbusecke · 2025-12-16T15:10:04Z

earthaccess/icechunk_opener.py

+            raise ValueError(
+                "A valid Earthdata login instance is required to retrieve credentials for icechunk stores"
+            )


I adopted this from earthaccess.get_s3_credentials. Please let me know if you think I should change the wording.

jbusecke · 2025-12-16T15:11:20Z

earthaccess/icechunk_opener.py

+    """
+    # get config and extract virtual containers
+    config = ic.Repository.fetch_config(storage=storage)
+    # TODO: accommodate case without virtual chunk containers.


Suggested change

# TODO: accommodate case without virtual chunk containers.

This currently works but emits a warning. We can discuss these details further.

jbusecke · 2025-12-16T15:12:00Z

earthaccess/icechunk_opener.py

+    # try to build authentication for all virtual chunk containers. If any of the virtual
+    # chunk containers is not 'approved' it will raise an error in `_get_credential_endpoint`.
+    # We will catch the error here, warn, and only return the authenticated urls.
+    # Users will then get an error for the remaining containers and need to add those manually!


Suggested change

# Users will then get an error for the remaining containers and need to add those manually!

# Users will then get a warning for the remaining containers and need to add those manually!

jbusecke · 2025-12-16T15:13:10Z

earthaccess/icechunk_opener.py

+                      If the URL is a non EDL bucket, you have to manually construct credentials (...)"
+        )
+
+    # TODO: Check how easy it is to 'splice' this output with manually created credentials


I will address this in the docs.

jbusecke · 2025-12-16T15:17:39Z

earthaccess/icechunk_opener.py

+    return ic.containers_credentials(credential_mapping)
+
+
+# TODO: Review datacube vocab? Do we want to use this? What is a good general term for zarr-ish data?


More general question (tagging in @DeanHenze), what vocabulary should we use here and throughout ea?
This does not influence functionality at all, just a matter of consistency. The current vocab has Granule and dataset?

virtual-zarr (not all stores have to be virtual so this might not be general enoug?)

virtual dataset (We could have e.g. datatrees too)

datacube (my current favorite, but happy to change this here).

Curious to hear what others think.

jbusecke · 2025-12-16T15:18:23Z

earthaccess/icechunk_opener.py

+    .
+    """
+    # currently only supports s3
+    # How would this support e.g. http, which other protocols make sense?


Would love to discuss this, but I think this is better served in a separate PR?

earthaccess/icechunk_opener.py

jbusecke · 2025-12-16T15:19:37Z

earthaccess/icechunk_opener.py

+    )
+
+    # return readonly store from main
+    # TODO: should this be configurable?


Maybe we want to give users at least the tag/branch id as a configurable input parameter?

jbusecke · 2025-12-16T15:21:14Z

tests/integration/test_icechunk_opener.py

+# TODO: I this the way to do it?
+earthaccess.login()


This is one of the actual technical questions remaining: Is this the way to ensure a login within the integration tests? It worked locally (when I set up env variables), but I was not sure if there is another pattern that you use for testing specifically.

@betolink

github-actions · 2025-12-16T15:22:57Z

User jbusecke does not have permission to run integration tests. A maintainer must perform a security review of the code changes in this pull request and re-run the failed integration tests jobs, if the code is deemed safe.

DeanHenze · 2025-12-17T18:47:56Z

* [x]  Zarr V3: Icechunk will need zarr v3. Curious about @DeanHenze @betolink assesement how difficult it will be to refactor the codebase to support zarr v3.

@betolink can correct me if I'm wrong but is ea already refactored for zarr v3?

For the use case where the icechunk store is outside of a DAAC, I'm not sure why this requires alternate code? For kerchunk JSON's I've created and stored a JSON locally and used it to access PO.DAAC data, so is icechunk different or what am I missing?

More general question (tagging in @DeanHenze), what vocabulary should we use here and throughout ea?
This does not influence functionality at all, just a matter of consistency. The current vocab has Granule and dataset?

I need to catch up, vocab within the ea package? Or user-facing vocab. "granule" and "collection" are the two terms to use internally to be consistent with NASA Earthdata terms. Is there a user-facing front end case you're thinking of?

virtual-zarr (not all stores have to be virtual so this might not be general enough?)
virtual dataset (We could have e.g. datatrees too)
datacube (my current favorite, but happy to change this here).

Could I get more context here as well, is this internal ea package vocab we're referring to?

Julius Busecke added 2 commits November 11, 2025 17:04

Minimal working example

773b1e7

Minimal working example

c870014

chuckwondo marked this pull request as draft November 12, 2025 20:17

chuckwondo reviewed Nov 12, 2025

View reviewed changes

earthaccess/icechunk.py Outdated Show resolved Hide resolved

chuckwondo reviewed Nov 12, 2025

View reviewed changes

earthaccess/icechunk_opener.py Show resolved Hide resolved

chuckwondo reviewed Nov 12, 2025

View reviewed changes

earthaccess/icechunk.py Outdated Show resolved Hide resolved

chuckwondo reviewed Nov 12, 2025

View reviewed changes

Apply suggestions from code review

d30f966

Co-authored-by: Chuck Daniels <cjdaniels4@gmail.com>

Julius Busecke added 3 commits November 19, 2025 16:36

Merge remote-tracking branch 'upstream/main' into icechunk_opener

250b287

before merge

8e5888f

Merge branch 'icechunk_opener' of https://github.com/jbusecke/earthac…

2adc485

…cess into icechunk_opener

WIP: factored out public get vchunk function + added some tests

7f1a8c2

Julius Busecke added 2 commits November 23, 2025 16:52

Fix formatting

9cda3b6

fix spelling

2a58274

Julius Busecke added 2 commits December 9, 2025 15:09

Renamed module + revised tests

51b1aef

Merge branch 'icechunk_opener' of https://github.com/jbusecke/earthac…

d2f4670

…cess into icechunk_opener

Moving to the veda hub for integration testing

5ba734b

Julius Busecke added 2 commits December 16, 2025 02:16

Various fixes, mocked unit tests, tested integration+units on veda hub

9a7e85b

shut up pre-commit

c775a3b

jbusecke commented Dec 16, 2025

View reviewed changes

earthaccess/icechunk_opener.py Outdated Show resolved Hide resolved

jbusecke commented Dec 16, 2025

View reviewed changes

Remove left over comments

054e133

maxrjones mentioned this pull request Jan 12, 2026

ODD PI 26.1 Objective 3: 🤖 Support virtualization of additional data products NASA-IMPACT/veda-odd#246

Closed

3 tasks

jhkennedy mentioned this pull request Jan 28, 2026

AWS temporary credentials from the ASF endpoint do not work for NISAR granules #1184

Closed

maxrjones mentioned this pull request Jan 30, 2026

earthdata support for virtual icechunk stores NASA-IMPACT/veda-odd#258

Open

5 tasks

		# TODO: Figure out how to ensure authentication here.


		def _get_daac_provider_from_url(url: str) -> str:

		@@ -0,0 +1,109 @@
		from datetime import datetime
		from typing import Dict, List, Optional

	# Users will then get an error for the remaining containers and need to add those manually!
	# Users will then get a warning for the remaining containers and need to add those manually!

		return ic.containers_credentials(credential_mapping)


		# TODO: Review datacube vocab? Do we want to use this? What is a good general term for zarr-ish data?

WIP: Icechunk opener #1135

Are you sure you want to change the base?

WIP: Icechunk opener #1135

Uh oh!

Conversation

jbusecke commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Nov 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

github-actions bot commented Nov 19, 2025

jbusecke commented Nov 19, 2025

jbusecke commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jbusecke commented Nov 19, 2025

github-actions bot commented Nov 19, 2025

jhkennedy commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jhkennedy commented Nov 19, 2025

jbusecke commented Nov 23, 2025

jbusecke commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. "Full EDL" case - Icechunk store and any virtual chunks (if present) are within EDL buckets

2. "Virtual EDL Chunks" Icechunk store is wherever, but all the virtual chunks point to one or more EDL buckets:

github-actions bot commented Nov 23, 2025

github-actions bot commented Nov 23, 2025

jbusecke commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

github-actions bot commented Dec 9, 2025

github-actions bot commented Dec 9, 2025

github-actions bot commented Dec 9, 2025

chuckwondo commented Dec 9, 2025

chuckwondo commented Dec 9, 2025

github-actions bot commented Dec 15, 2025

github-actions bot commented Dec 16, 2025

github-actions bot commented Dec 16, 2025

chuckwondo commented Dec 16, 2025

jbusecke commented Dec 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 16, 2025

DeanHenze commented Dec 17, 2025

Labels

6 participants

jbusecke commented Nov 11, 2025 •

edited

Loading

github-actions bot commented Nov 11, 2025 •

edited

Loading

jbusecke commented Nov 19, 2025 •

edited

Loading

jhkennedy commented Nov 19, 2025 •

edited

Loading

jbusecke commented Nov 23, 2025 •

edited

Loading

jbusecke commented Nov 24, 2025 •

edited

Loading