Re: [DISCUSSION] Extending TableData API

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Extending TableData API

Andrea Santurbano
Hi guys,
this is great! I think this can also enable some drop-down feature between
tables in the UI...
Do you think this enhancements can also include the graph part?

Andrea

Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim <[hidden email]> ha
scritto:

> All of the enhancements looks great to me!
>
> And I wish a feature which can upload a small CSV file (maybe about
> 20MB..?) and play with it directly.
> It would be great if I can drag a file to Zeppelin and register it as the
> table.
>
> Thanks :)
>
> 2017년 6월 12일 (월) 오전 11:40, Park Hoon <[hidden email]>님이 작성:
>
>> Hi All,
>>
>> Recently, ZEPPELIN-753
>> <https://issues.apache.org/jira/browse/ZEPPELIN-753> (Tabledata
>> abstraction) and ZEPPELIN-2020
>> <https://issues.apache.org/jira/browse/ZEPPELIN-2020> (Remote method
>> invocation for resources) were resolved.
>> Based on this work, we can improve Zeppelin with the following
>> enhancements:
>>
>> * register the table result as a shared resource
>> * list all available (registered) tables
>> * preview tables including its meta information (e.g columns, types, ..)
>> * download registered tables as CSV, and other formats.
>> * pivoting/filtering in backend to transforming larger data
>> * cross join tables in different interpreters (e.g Spark interpreter uses
>> a table result generated from JDBC interpreter)
>>
>> You can find the full proposal in Extending Table Data API
>> <https://cwiki.apache.org/confluence/display/ZEPPELIN/Proposal%3A+Extending+TableData+API> which
>> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>>
>> Any question, feedback or discussion will be welcomed.
>>
>>
>> Thanks.
>>
> --
> Taejun Kim
>
> Data Mining Lab.
> School of Electrical and Computer Engineering
> University of Seoul
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Extending TableData API

Khalid Huseynov-3
Thanks for the questions guys!

@Jun Kim actually that feature was originally discussed and was put into
backlog since proposal was more about tables processed by interpreters and
their sharing. However having quick visualisation on the fly for not so
large data makes sense indeed, and possibly could be done by importing data
into some interpreter by default (Spark, python, etc). So I believe it can
be done once initial basics for resource sharing is completed.

@Andrea Santurbano there should be listing of tables with schema info, but
i'm not sure exactly what you mean by drop-down feature between tables in
the UI. Could you give little more details/example on that as well as
 enhancements on graph part you meant?


On Mon, Jun 12, 2017 at 4:01 PM, Andrea Santurbano <[hidden email]>
wrote:

> Hi guys,
> this is great! I think this can also enable some drop-down feature between
> tables in the UI...
> Do you think this enhancements can also include the graph part?
>
> Andrea
>
> Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim <[hidden email]> ha
> scritto:
>
>> All of the enhancements looks great to me!
>>
>> And I wish a feature which can upload a small CSV file (maybe about
>> 20MB..?) and play with it directly.
>> It would be great if I can drag a file to Zeppelin and register it as the
>> table.
>>
>> Thanks :)
>>
>> 2017년 6월 12일 (월) 오전 11:40, Park Hoon <[hidden email]>님이 작성:
>>
>>> Hi All,
>>>
>>> Recently, ZEPPELIN-753
>>> <https://issues.apache.org/jira/browse/ZEPPELIN-753> (Tabledata
>>> abstraction) and ZEPPELIN-2020
>>> <https://issues.apache.org/jira/browse/ZEPPELIN-2020> (Remote method
>>> invocation for resources) were resolved.
>>> Based on this work, we can improve Zeppelin with the following
>>> enhancements:
>>>
>>> * register the table result as a shared resource
>>> * list all available (registered) tables
>>> * preview tables including its meta information (e.g columns, types, ..)
>>> * download registered tables as CSV, and other formats.
>>> * pivoting/filtering in backend to transforming larger data
>>> * cross join tables in different interpreters (e.g Spark interpreter
>>> uses a table result generated from JDBC interpreter)
>>>
>>> You can find the full proposal in Extending Table Data API
>>> <https://cwiki.apache.org/confluence/display/ZEPPELIN/Proposal%3A+Extending+TableData+API> which
>>> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>>>
>>> Any question, feedback or discussion will be welcomed.
>>>
>>>
>>> Thanks.
>>>
>> --
>> Taejun Kim
>>
>> Data Mining Lab.
>> School of Electrical and Computer Engineering
>> University of Seoul
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Extending TableData API

Jeff Zhang

Hi Park, 

Thanks for the sharing, this is a very interested and innovated idea. I have several comments and concerns.

1. What does the resource registration mean ?
   IIUC, currently it means it would cache the data in Interpreter Process. Then it might be a memory issue when more and more resources are registered. Maybe we could introduce resource retention mechanism or cache the data in other formats (just like the spark table cache policy, user can specify how to cache the data, like memory, disk and etc.)

2. The scope of resource sharing
   For now, it seems it is globally shared. But I think user level sharing might be more common. Then we need to create a namespace for each user. That means the same resource name could exist in different user namespace. 
 
3. The data route might cause performance issue.
   From the diagram, If spark interpreter needs to access a resource from jdbc interpreter. Then first data needs to be send to zeppelin server, and then zeppelin server send the data to spark interpreter. This kind of data route introduce a bit more overhead to me. And zeppelin server will become a bottleneck and require large memory when there're many resources to be shared across users/interpreters. So I would suggest the following approach. Zeppelin Server just control the metadata and ACL of resources. And Spark Interpreter will fetch data from Jdbc Interpreter directly instead of through zeppelin server.  Here's the sequences
       1). SparkInterpreter ask for metadata and token for the resource
       2). Zeppelin Server will check whether this SparkInterprter has permission to access this resource, if yes, then send the metadata and token to SparkInterpreter. The metadata includes the RPC address of the JdbcInterpreter and token is for security.
       3). SparkInterpreter ask JdbcInterpreter for the resource via the the token and metadata received in step 2
       4). JdbcInterpreter verify the token, and send the data to SparkInterpreter.


Khalid Huseynov <[hidden email]>于2017年6月13日周二 上午11:53写道:
Thanks for the questions guys!

@Jun Kim actually that feature was originally discussed and was put into backlog since proposal was more about tables processed by interpreters and their sharing. However having quick visualisation on the fly for not so large data makes sense indeed, and possibly could be done by importing data into some interpreter by default (Spark, python, etc). So I believe it can be done once initial basics for resource sharing is completed. 

@Andrea Santurbano there should be listing of tables with schema info, but i'm not sure exactly what you mean by drop-down feature between tables in the UI. Could you give little more details/example on that as well as  enhancements on graph part you meant?


On Mon, Jun 12, 2017 at 4:01 PM, Andrea Santurbano <[hidden email]> wrote:
Hi guys,
this is great! I think this can also enable some drop-down feature between tables in the UI...
Do you think this enhancements can also include the graph part?

Andrea

Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim <[hidden email]> ha scritto:
All of the enhancements looks great to me!

And I wish a feature which can upload a small CSV file (maybe about 20MB..?) and play with it directly.
It would be great if I can drag a file to Zeppelin and register it as the table.

Thanks :)

2017년 6월 12일 (월) 오전 11:40, Park Hoon <[hidden email]>님이 작성:
Hi All,

Recently, ZEPPELIN-753 (Tabledata abstraction) and ZEPPELIN-2020 (Remote method invocation for resources) were resolved.
Based on this work, we can improve Zeppelin with the following enhancements:

* register the table result as a shared resource
* list all available (registered) tables
* preview tables including its meta information (e.g columns, types, ..)
* download registered tables as CSV, and other formats.
* pivoting/filtering in backend to transforming larger data
* cross join tables in different interpreters (e.g Spark interpreter uses a table result generated from JDBC interpreter)

You can find the full proposal in Extending Table Data API which is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.

Any question, feedback or discussion will be welcomed.


Thanks. 
--
Taejun Kim

Data Mining Lab.
School of Electrical and Computer Engineering
University of Seoul

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Extending TableData API

Park Hoon
@Jeff, Thanks for sharing your opinions and important questions.


> Q1. What does the resource registration meanIIUC, currently it means it would cache the data in Interpreter Process. Then it might be a memory issue when more and more resources are registered. Maybe we could introduce resource retention mechanism or cache the data in other formats (just like the spark table cache policy, user can specify how to cache the data, like memory, disk and etc.

A1. It depends on an implementation of TableData for each interpreter. For example,

If JDBC interpreter only keeps the SQL in a paragraph to reproduce the table, we don’t need to persist the whole table data in memory or file system or an external storage. That’s what the section 3.2 describes.





> Q2. The scope of resource sharingFor now, it seems it is globally shared. But I think user level sharing might be more common. Then we need to create a namespace for each user. That means the same resource name could exist in different user namespace. 

A2. Regarding the namespace concept, the proposal only describes what the table resource name should be? (Section 5.3) not about namespaces.

The namespace can be the name of a note or custom (e.g creating users’ namespace). We can discuss this.

Personally, +1 for having namespace because it is helpful for searching and sharing. This might be included by `ResourceRegistry` 


 

> Q3. The data route might cause performance issue.  From the diagram, If spark interpreter needs to access a resource from jdbc interpreter. Then first data needs to be send to zeppelin server, and then zeppelin server send the data to spark interpreter. This kind of data route introduce a bit more overhead to me. And zeppelin server will become a bottleneck and require large memory when there're many resources to be shared across users/interpreters. So I would suggest the following approach. Zeppelin Server just control the metadata and ACL of resources. And Spark Interpreter will fetch data from Jdbc Interpreter directly instead of through zeppelin server.  Here's the sequences
       1). SparkInterpreter ask for metadata and token for the resource
       2). Zeppelin Server will check whether this SparkInterprter has permission to access this resource, if yes, then send the metadata and token to SparkInterpreter. The metadata includes the RPC address of the JdbcInterpreter and token is for security.
       3). SparkInterpreter ask JdbcInterpreter for the resource via the the token and metadata received in step 2
       4). JdbcInterpreter verify the token, and send the data to SparkInterpreter.

A3. +1 direct accessing in spark interpreter to JDBC since it’s better for large data handling. But not sure about how other interpreters can do the same thing. (e.g trivial, but let’s think about shell interpreter which keeps it’s tabledata on memory)




Some people might wonder why we do not use external storages to persist (large) table resources instead of keeping them in memory of ZeppelinServer.

The authors originally discussed whether having an external storage or not. But having external storage

- requires additional (lots of) dependencies. (Geode? Redis? HDFS? Which one should we use? or support all?)
- even with external storage, we might not persist 400GB, 10TB. 

Thus, the proposal was written to 

- utilize interpreter’s own storage (e.g spark cluster for spark interpreter)
- keep the minimal things to reproduce the table result (e.g keeping the only query) while don’t affect on external storage as well at first. 


And now we are discussing, hope we can improve the proposal and turn it into a reall implementation soon. :) 



Thanks.




On Wed, Jun 14, 2017 at 12:20 PM, Jeff Zhang <[hidden email]> wrote:

Hi Park, 

Thanks for the sharing, this is a very interested and innovated idea. I have several comments and concerns.

1. What does the resource registration mean ?
   IIUC, currently it means it would cache the data in Interpreter Process. Then it might be a memory issue when more and more resources are registered. Maybe we could introduce resource retention mechanism or cache the data in other formats (just like the spark table cache policy, user can specify how to cache the data, like memory, disk and etc.)

2. The scope of resource sharing
   For now, it seems it is globally shared. But I think user level sharing might be more common. Then we need to create a namespace for each user. That means the same resource name could exist in different user namespace. 
 
3. The data route might cause performance issue.
   From the diagram, If spark interpreter needs to access a resource from jdbc interpreter. Then first data needs to be send to zeppelin server, and then zeppelin server send the data to spark interpreter. This kind of data route introduce a bit more overhead to me. And zeppelin server will become a bottleneck and require large memory when there're many resources to be shared across users/interpreters. So I would suggest the following approach. Zeppelin Server just control the metadata and ACL of resources. And Spark Interpreter will fetch data from Jdbc Interpreter directly instead of through zeppelin server.  Here's the sequences
       1). SparkInterpreter ask for metadata and token for the resource
       2). Zeppelin Server will check whether this SparkInterprter has permission to access this resource, if yes, then send the metadata and token to SparkInterpreter. The metadata includes the RPC address of the JdbcInterpreter and token is for security.
       3). SparkInterpreter ask JdbcInterpreter for the resource via the the token and metadata received in step 2
       4). JdbcInterpreter verify the token, and send the data to SparkInterpreter.
image.png


Khalid Huseynov <[hidden email]>于2017年6月13日周二 上午11:53写道:
Thanks for the questions guys!

@Jun Kim actually that feature was originally discussed and was put into backlog since proposal was more about tables processed by interpreters and their sharing. However having quick visualisation on the fly for not so large data makes sense indeed, and possibly could be done by importing data into some interpreter by default (Spark, python, etc). So I believe it can be done once initial basics for resource sharing is completed. 

@Andrea Santurbano there should be listing of tables with schema info, but i'm not sure exactly what you mean by drop-down feature between tables in the UI. Could you give little more details/example on that as well as  enhancements on graph part you meant?


On Mon, Jun 12, 2017 at 4:01 PM, Andrea Santurbano <[hidden email]> wrote:
Hi guys,
this is great! I think this can also enable some drop-down feature between tables in the UI...
Do you think this enhancements can also include the graph part?

Andrea

Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim <[hidden email]> ha scritto:
All of the enhancements looks great to me!

And I wish a feature which can upload a small CSV file (maybe about 20MB..?) and play with it directly.
It would be great if I can drag a file to Zeppelin and register it as the table.

Thanks :)

2017년 6월 12일 (월) 오전 11:40, Park Hoon <[hidden email]>님이 작성:
Hi All,

Recently, ZEPPELIN-753 (Tabledata abstraction) and ZEPPELIN-2020 (Remote method invocation for resources) were resolved.
Based on this work, we can improve Zeppelin with the following enhancements:

* register the table result as a shared resource
* list all available (registered) tables
* preview tables including its meta information (e.g columns, types, ..)
* download registered tables as CSV, and other formats.
* pivoting/filtering in backend to transforming larger data
* cross join tables in different interpreters (e.g Spark interpreter uses a table result generated from JDBC interpreter)

You can find the full proposal in Extending Table Data API which is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.

Any question, feedback or discussion will be welcomed.


Thanks. 
--
Taejun Kim

Data Mining Lab.
School of Electrical and Computer Engineering
University of Seoul


Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSSION] Extending TableData API

Jeff Zhang
>>> But not sure about how other interpreters can do the same thing. (e.g
trivial, but let’s think about shell interpreter which keeps it’s tabledata
on memory)

The approach I proposed is general for all the interpreters. What we need
do is to add one method in RemoteInterpreterProcess for other
interpreters to fetch resources.

>>> Some people might wonder why we do not use external storages to persist
(large) table resources instead of keeping them in memory of ZeppelinServer.

It is fine to use memory for now. But we should leave an interface there
for other storages. For now we could just have MemoryStorage, could have
other implementations in future.


Park Hoon <[hidden email]>于2017年6月14日周三 下午10:22写道:

> @Jeff, Thanks for sharing your opinions and important questions.
>
>
> > Q1. What does the resource registration mean? IIUC, currently it means
> it would cache the data in Interpreter Process. Then it might be a memory
> issue when more and more resources are registered. Maybe we could introduce
> resource retention mechanism or cache the data in other formats (just like
> the spark table cache policy, user can specify how to cache the data, like
> memory, disk and etc.
>
> A1. It depends on an implementation of TableData for each interpreter.
> For example,
>
> If JDBC interpreter only keeps the SQL in a paragraph to reproduce the
> table, we don’t need to persist the whole table data in memory or file
> system or an external storage. That’s what the section 3.2 describes.
>
> [image: Inline image 2]
>
>
>
>
> > Q2. The scope of resource sharing. For now, it seems it is globally
> shared. But I think user level sharing might be more common. Then we need
> to create a namespace for each user. That means the same resource name
> could exist in different user namespace.
>
> A2. Regarding the namespace concept, the proposal only describes what the
> table resource name should be? (Section 5.3) not about namespaces.
>
> The namespace can be the name of a note or custom (e.g creating users’
> namespace). We can discuss this.
>
> Personally, +1 for having namespace because it is helpful for searching
> and sharing. This might be included by `ResourceRegistry`
>
>
> [image: Inline image 1]
>
>
> > Q3. The data route might cause performance issue.  From the diagram, If
> spark interpreter needs to access a resource from jdbc interpreter. Then
> first data needs to be send to zeppelin server, and then zeppelin server
> send the data to spark interpreter. This kind of data route introduce a bit
> more overhead to me. And zeppelin server will become a bottleneck and
> require large memory when there're many resources to be shared across
> users/interpreters. So I would suggest the following approach. Zeppelin
> Server just control the metadata and ACL of resources. And Spark
> Interpreter will fetch data from Jdbc Interpreter directly instead of
> through zeppelin server.  Here's the sequences
>        1). SparkInterpreter ask for metadata and token for the resource
>        2). Zeppelin Server will check whether this SparkInterprter has
> permission to access this resource, if yes, then send the metadata and
> token to SparkInterpreter. The metadata includes the RPC address of the
> JdbcInterpreter and token is for security.
>        3). SparkInterpreter ask JdbcInterpreter for the resource via the
> the token and metadata received in step 2
>        4). JdbcInterpreter verify the token, and send the data to
> SparkInterpreter.
>
> A3. +1 direct accessing in spark interpreter to JDBC since it’s better for
> large data handling. But not sure about how other interpreters can do the
> same thing. (e.g trivial, but let’s think about shell interpreter which
> keeps it’s tabledata on memory)
>
>
> ------------------------------
>
> Some people might wonder why we do not use external storages to persist
> (large) table resources instead of keeping them in memory of ZeppelinServer.
>
> The authors originally discussed whether having an external storage or
> not. But having external storage
>
> - requires additional (lots of) dependencies. (Geode? Redis? HDFS? Which
> one should we use? or support all?)
> - even with external storage, we might not persist 400GB, 10TB.
>
> Thus, the proposal was written to
>
> - utilize interpreter’s own storage (e.g spark cluster for spark
> interpreter)
> - keep the minimal things to reproduce the table result (e.g keeping the
> only query) while don’t affect on external storage as well at first.
>
>
> And now we are discussing, hope we can improve the proposal and turn it
> into a reall implementation soon. :)
>
>
>
> Thanks.
>
>
>
>
> On Wed, Jun 14, 2017 at 12:20 PM, Jeff Zhang <[hidden email]> wrote:
>
>>
>> Hi Park,
>>
>> Thanks for the sharing, this is a very interested and innovated idea. I
>> have several comments and concerns.
>>
>> 1. What does the resource registration mean ?
>>    IIUC, currently it means it would cache the data in Interpreter
>> Process. Then it might be a memory issue when more and more resources are
>> registered. Maybe we could introduce resource retention mechanism or cache
>> the data in other formats (just like the spark table cache policy, user can
>> specify how to cache the data, like memory, disk and etc.)
>>
>> 2. The scope of resource sharing
>>    For now, it seems it is globally shared. But I think user level
>> sharing might be more common. Then we need to create a namespace for each
>> user. That means the same resource name could exist in different user
>> namespace.
>>
>> 3. The data route might cause performance issue.
>>    From the diagram, If spark interpreter needs to access a resource from
>> jdbc interpreter. Then first data needs to be send to zeppelin server, and
>> then zeppelin server send the data to spark interpreter. This kind of data
>> route introduce a bit more overhead to me. And zeppelin server will become
>> a bottleneck and require large memory when there're many resources to be
>> shared across users/interpreters. So I would suggest the following
>> approach. Zeppelin Server just control the metadata and ACL of resources.
>> And Spark Interpreter will fetch data from Jdbc Interpreter directly
>> instead of through zeppelin server.  Here's the sequences
>>        1). SparkInterpreter ask for metadata and token for the resource
>>        2). Zeppelin Server will check whether this SparkInterprter has
>> permission to access this resource, if yes, then send the metadata and
>> token to SparkInterpreter. The metadata includes the RPC address of the
>> JdbcInterpreter and token is for security.
>>        3). SparkInterpreter ask JdbcInterpreter for the resource via the
>> the token and metadata received in step 2
>>        4). JdbcInterpreter verify the token, and send the data to
>> SparkInterpreter.
>> [image: image.png]
>>
>>
>> Khalid Huseynov <[hidden email]>于2017年6月13日周二 上午11:53写道:
>>
>>> Thanks for the questions guys!
>>>
>>> @Jun Kim actually that feature was originally discussed and was put
>>> into backlog since proposal was more about tables processed by interpreters
>>> and their sharing. However having quick visualisation on the fly for not so
>>> large data makes sense indeed, and possibly could be done by importing data
>>> into some interpreter by default (Spark, python, etc). So I believe it can
>>> be done once initial basics for resource sharing is completed.
>>>
>>> @Andrea Santurbano there should be listing of tables with schema info,
>>> but i'm not sure exactly what you mean by drop-down feature between
>>> tables in the UI. Could you give little more details/example on that as
>>> well as  enhancements on graph part you meant?
>>>
>>>
>>> On Mon, Jun 12, 2017 at 4:01 PM, Andrea Santurbano <[hidden email]>
>>> wrote:
>>>
>>>> Hi guys,
>>>> this is great! I think this can also enable some drop-down feature
>>>> between tables in the UI...
>>>> Do you think this enhancements can also include the graph part?
>>>>
>>>> Andrea
>>>>
>>>> Il giorno lun 12 giu 2017 alle ore 05:47 Jun Kim <[hidden email]>
>>>> ha scritto:
>>>>
>>>>> All of the enhancements looks great to me!
>>>>>
>>>>> And I wish a feature which can upload a small CSV file (maybe about
>>>>> 20MB..?) and play with it directly.
>>>>> It would be great if I can drag a file to Zeppelin and register it as
>>>>> the table.
>>>>>
>>>>> Thanks :)
>>>>>
>>>>> 2017년 6월 12일 (월) 오전 11:40, Park Hoon <[hidden email]>님이 작성:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> Recently, ZEPPELIN-753
>>>>>> <https://issues.apache.org/jira/browse/ZEPPELIN-753> (Tabledata
>>>>>> abstraction) and ZEPPELIN-2020
>>>>>> <https://issues.apache.org/jira/browse/ZEPPELIN-2020> (Remote method
>>>>>> invocation for resources) were resolved.
>>>>>> Based on this work, we can improve Zeppelin with the following
>>>>>> enhancements:
>>>>>>
>>>>>> * register the table result as a shared resource
>>>>>> * list all available (registered) tables
>>>>>> * preview tables including its meta information (e.g columns, types,
>>>>>> ..)
>>>>>> * download registered tables as CSV, and other formats.
>>>>>> * pivoting/filtering in backend to transforming larger data
>>>>>> * cross join tables in different interpreters (e.g Spark interpreter
>>>>>> uses a table result generated from JDBC interpreter)
>>>>>>
>>>>>> You can find the full proposal in Extending Table Data API
>>>>>> <https://cwiki.apache.org/confluence/display/ZEPPELIN/Proposal%3A+Extending+TableData+API> which
>>>>>> is contributed by @1ambda, @khalidhuseynov, @Leemoonsoo.
>>>>>>
>>>>>> Any question, feedback or discussion will be welcomed.
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>> --
>>>>> Taejun Kim
>>>>>
>>>>> Data Mining Lab.
>>>>> School of Electrical and Computer Engineering
>>>>> University of Seoul
>>>>>
>>>>
>>>
>