Skip to content

Improve documentation on how to handle binary and non-binary files (local/remote, up-/download) #595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
do-me opened this issue Jul 12, 2022 · 13 comments
Closed
3 tasks done
Assignees
Labels
backlog issue has been triaged but has not been earmarked for any upcoming release tag: docs Related to the documentation

Comments

@do-me
Copy link

do-me commented Jul 12, 2022

Checklist

  • I added a descriptive title
  • I searched for other issues and couldn't find a duplication
  • I already searched in Google and didn't find any good information or help

What is the issue/comment/problem?

There are a few issues around here concerned with file handling (#588, #558, #463, #151 amongst others).
It would be nice to have a dedicated section in the docs with the recommended way of doing things for binary and non-binary files.
Summed up:

Local

  • Load local file to browser (covered here or here)
  • Download file from browser to local (two examples here with file picker, but non-binary data only)

Remote

Due to the different nature of (non-) binary files (e.g. excel or genereally zip files), it would be very useful to have the differentiation included as else one stumples across missing await's or similar.

I think most of the above points are already described somewhere but I'm missing an example of how to conveniently access the virtual file system in order to download something locally.

Let's consider this:

from pyodide.http import pyfetch
import asyncio
import pandas as pd 
import openpyxl
from io import BytesIO

response = await pyfetch(url="/downloads/test.xlsx", method="GET")
bytes_response = await response.bytes()
df = pd.read_excel(BytesIO(bytes_response))
df

That's the (currently) easiest way of loading binary files. If I call df.to_excel("test_output.xlsx") and df.to_csv("test_output.csv") pandas will save the output to the virtual file system.

What's the best way of automatically starting the download from the browser to local when pandas is done saving to the virtual file system or could this even be skipped in some way? Do we need to use some js proxy, js buffer for the hooks or would you simply use some pyodide function for this?

@do-me do-me added the needs-triage Issue needs triage label Jul 12, 2022
@marimeireles marimeireles added tag: docs Related to the documentation and removed needs-triage Issue needs triage labels Jul 20, 2022
@marimeireles
Copy link
Contributor

I'm not sure if this issue was discussed last week when I wasn't around.
But maybe @antocuni has an opinion on this? Or should I ping Fabio here?
Thanks!
And thanks @do-me for opening the issue =)

@antocuni
Copy link
Contributor

What's the best way of automatically starting the download from the browser to local when pandas is done saving to the virtual file system or could this even be skipped in some way?

AFAIK we don't have a PyScript specific way to do it, so currently the best way is to use pyodide. This stackoverflow answer shows a possible solution:
https://stackoverflow.com/questions/64669355/how-to-copy-download-file-created-in-pyodide-in-browser

Yes, we should provide a more straightforward way of doing it.
Yes, we should definitely improve the docs :).

@do-me
Copy link
Author

do-me commented Jul 22, 2022

Thanks, later I'll look into it. Meanwhile I might have found a different cross-browser solution for downloading blobs, will test later and update here. Once I get this running, I'll document everything and set up minimal examples for every variant.

@do-me
Copy link
Author

do-me commented Jul 22, 2022

Just sharing some WIP in case anyone needs it asap. Loading a remote excel file .xlsx, reading as pandas df and downloading as .csv with the file picker solution.

Requires a download HTML button on the page, e.g. <button id="download">Download</button>

from pyodide.http import pyfetch
import asyncio
import pandas as pd 
import openpyxl
from io import BytesIO
import sys
from js import alert, document, Object, window
from pyodide import create_proxy, to_js

async def load_df():
  response = await pyfetch(url="/downloads/test.xlsx", method="GET")
  bytes_response = await response.bytes()
  df = pd.read_excel(BytesIO(bytes_response))
  content = df.to_csv() # returns string when file name missing 
  return content
  
async def file_save(event):
	try:
		options = {
			"startIn": "downloads",
			"suggestedName": "test_123456.csv"
		}

		fileHandle = await window.showSaveFilePicker(Object.fromEntries(to_js(options)))
	except Exception as e:
		console.log('Exception: ' + str(e))
		return

	content = await load_df()

	file = await fileHandle.createWritable()
	await file.write(content)
	await file.close()
	return

def setup_button():
	# Create a Python proxy for the callback function
	file_save_proxy = create_proxy(file_save)

	# Set the listener to the callback
	document.getElementById("download").addEventListener("click", file_save_proxy, False)

setup_button()

I'm working on
a) cross-browser functionality as file picker isn't working in Firefox and
b) blob (= e.g. xlsx files) downloads.

@hellozeyu
Copy link

Not sure if this is super related, but we created a WordPress plugin around Pyscript, and most of the examples work on the site. However, it always throws this error when we try to read a remote csv file in pandas or even just read a remote URL. Is this related to encoding? The weird thing is the other examples, including the matplotlib one works.

image

@do-me
Copy link
Author

do-me commented Jul 27, 2022

Hi @hellozeyu this is not related as your URL is simply wrong. You're trying to read a csv from the GitHub landing page https://github.com/. Insert the real link (raw csv file, not the repo) and it should work. E.g. this one.

@do-me
Copy link
Author

do-me commented Jul 27, 2022

Ah sorry, thought this was pandas-related. You cannot work with the urllib or requests package in pyscript but need to use the pyodide alternatives. See this example.

@hellozeyu
Copy link

Got it. It works for me. Thanks!

@marimeireles
Copy link
Contributor

@do-me
Do you think the solution linked in this issue fits your use case? #756
Also, I think you already found a solution? Not sure the last time we talked you said you had something almost working?
Lemme know if you need help, we can sync =)
I think it'd be really cool to have docs on it and we can do it on a style of "how to". Basically just some code snippets that work and a short explanation on why it works that way it does it'd be perfect.
Jeff Glass contributed something like this last week: https://docs.pyscript.net/latest/howtos/passing-objects.html
The one about output could be much shorter though.

@do-me
Copy link
Author

do-me commented Sep 12, 2022

Hi @marimeireles!
Thanks for coming back to this issue.
I'm not at home this week but I'll have a look at the new docs next week. Looks promising!

I might have found a way for the last missing piece (binary downloads like excel) via octet streams and DOM manipulation. I didn't find the time yet to test properly, but as soon as I succeed, I'll come back here! So technically the issue is not yet 100% solved I'd say.

Great idea for the snippet-style docs - I think that really suits the spirit of pyscript!

@marimeireles
Copy link
Contributor

Alright! :)
I'm around just ping me.

@do-me
Copy link
Author

do-me commented Oct 3, 2022

I finally found the time for testing binary downloads from the virtual file system. I wrote a simple function that takes care of everything and saves a pandas excel export to the local file system:

from pyodide.http import pyfetch
import asyncio
import pandas as pd 
import openpyxl
from io import BytesIO
import base64
from js import document

def pandas_excel_export(df, filename):
    # save to virtual filesystem
    df.to_excel(filename + ".xlsx")

    # binary xlsx to base64 encoded downloadable string 
    data = open("test.xlsx", 'rb').read()
    base64_encoded = base64.b64encode(data).decode('UTF-8')
    octet_string = "data:application/octet-stream;base64,"
    download_string = octet_string + base64_encoded

    # create new helper DOM element, click (download) and remove 
    element = document.createElement('a')
    element.setAttribute("href",download_string)
    element.setAttribute("download",filename + ".xlsx")
    element.click()
    element.remove()

# import 
response = await pyfetch("/downloads/test.xlsx", method="GET")
bytes_response = await response.bytes()

# read from bytes
df = pd.read_excel(BytesIO(bytes_response))

# manipulate
df["d"] = df["a"] + df["b"]

# export
pandas_excel_export(df,"test")

Working example here.

Coming back to the original purpose of this issue, I think we have everything we need to improve the documentation!

What do you think about a dedicted `File Handling` section in the docs under Getting Started? Or would you rather think it belongs more to the How-to section?

I am preparing a dedicated blog post in the spirit of the original issue description (local/remote & import/export & non-binary/binary data) that could serve as a base for further discussion.

This was referenced Oct 4, 2022
@marimeireles marimeireles added the backlog issue has been triaged but has not been earmarked for any upcoming release label Oct 4, 2022
@tedpatrick tedpatrick moved this to Next in PyScript OSS Apr 3, 2023
@WebReflection
Copy link
Contributor

I am closing this for the following reasons:

  • we now offer a fetch(...).bytearray() to solve the conversion issue
  • we have documented how to write, read, upload, download files via latest PyScript
  • binary VS non binary is still a matter of open(..., 'rb') VS open(..., 'r') so I hope we covered it all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog issue has been triaged but has not been earmarked for any upcoming release tag: docs Related to the documentation
Projects
Status: Next
Development

No branches or pull requests

5 participants